Dataset: ODPtweets

ODPtweets is a large-scale Twitter dataset with nearly 25 million tweets categorized in the structure of the Open Directory Project (ODP). This dataset was used for the WWW 2013 paper titled Harnessing Web Page Directories for Large-Scale Classification of Tweets. The categorization of tweets is inferred from the links they are pointing to -- for details, please check out the paper.

The dataset contains tweets for two different timeframes: March 17 to 29, 2012 and April 12 to 24, 2012.

Format

The two files provided here are formatted with a tweet per line, each line with the following fields separated by tabs:


          tweet_id   username   md5_hash   odp_category

where tweet_id and username can be used to form the URL of the tweet, md5_hash enables to validate the content of the tweet, and odp_category is the category annotated for the tweet in question, according to the ODP structure.

In order to respect Twitter's TOS, tweets are not redistributed and only tweets ids and author screen names are provided. Tweet texts can be downloaded by using any of the following tools:

SemEval-2013 Task 2 Download script (in Python)
http://www.cs.york.ac.uk/semeval-2013/task2/index.php?id=data
RepLab 2013 Twitter Texts Downloader (in Java)
http://nlp.uned.es/replab2013/replab2013_twitter_texts_downloader_v0.7.tar.gz
TREC Microblog Track (in Java)
https://github.com/lintool/twitter-tools

Reference

Please, cite the following paper if you make use of this dataset for your research work:

Arkaitz Zubiaga, Heng Ji.
Harnessing Web Page Directories for Large-Scale Classification of Tweets.
WWW 2013

BiBTeX

Download

The dataset is provided in 2 separate files, one for each timeframe, as specified above:

ODPtweets-Mar17-29.tar.bz2 (368 MB)
ODPtweets-Apr12-24.tar.bz2 (428 MB)