A TREC-like data collection to evaluate approaches for the task of related-tweet retrieval for news articles.

Download

You would need to follow this link to get the dataset.
https://goo.gl/forms/R9yYo3lQSQTUtnHc2

Format

Upon downloading the data, you get a single compressed file. You can uncompress it using unzip. Uncompressing yields a folder with 2 files:

Using the dataset

As in any TREC task, to use the dataset:
  1. Use the topics file as an input to your tweet retrieval approach. In particular, your approach should return a ranked list of tweet IDs for each news article (topic) in a TREC results file format. Let's call it approach.result.
    Each line in your file should conform to the following:

    topic Q0 tweet-id rank score NAME

    You can find the tweet collection used to build this dataset here.

  2. Use trec_eval to evaluate the effectiveness of your approach by running:
    trec_eval -q signal1m_tweets_qrels approach.result

Citing

This collection was described in a paper on ECIR 2018: A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles .

@inproceedings{Signal1MRelatedTweetsRetrieval2018,
  author    = {Axel Suarez, Dyaa Albakour, David Corney, Miguel Martinez and Jose Esquivel},
  title     = {A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles},
  booktitle = {40th European Conference on Information Retrieval Research {(ECIR} 2018), Grenoble, France, March, 2018.},
  year      = {2018},
  pages     = {780-786},
  url       = {https://link.springer.com/chapter/10.1007/978-3-319-76941-7_76}
}