Dataset: Hurricane Sandy tweets

This dataset contains nearly 15 million tweets posted on Twitter while Hurricane Sandy was hitting the East Coast of the United States, as well as in the aftermath. Tweets were collected from October 25, 2012, to November 4, 2012, using the keywords 'hurricane' and 'sandy' (and thus by definition variations like #hurricane and #sandy). For more about the hurricane, see Hurricane Sandy.

Format

The file provided here is formatted with a tweet per line, each line with the following fields separated by tabs:

tweet_id   username   md5_hash

where tweet_id and username can be used to form the URL of the tweet, md5_hash enables to validate the content of the tweet, and odp_category is the category annotated for the tweet in question, according to the ODP structure.

In order to respect Twitter's TOS, tweets are not redistributed and only tweets ids and author screen names are provided. Tweet texts can be downloaded by using any of the following tools:

  1. SemEval-2013 Task 2 Download script (in Python)
    http://www.cs.york.ac.uk/semeval-2013/task2/index.php?id=data
  2. RepLab 2013 Twitter Texts Downloader (in Java)
    http://nlp.uned.es/replab2013/replab2013_twitter_texts_downloader_v0.7.tar.gz
  3. TREC Microblog Track (in Java)
    https://github.com/lintool/twitter-tools

Reference

Please, cite the following paper if you make use of this dataset for your research work:

Arkaitz Zubiaga, Heng Ji.
Tweet, but Verify: Epistemic Study of Information Verification on Twitter.
Social Network Analysis and Mining. In press.

BiBTeX

Download