Dataset: Sentiment1319

Sentiment1319 is a large-scale, longitudinal Twitter dataset with tweets labelled for sentiment as positive or negative. It contains over 33 million tweets, which are extracted from the Twitter Stream Internet Archive project, with sentiment-bearing tweets extracted by following the methodology described by (Go et al., 2009).

We followed these steps to generate this dataset:

  1. Data collection: we retrieved tweets from January 2013 to October 2019 though the Twitter Stream Grab project. Links to these files are available through the twitter-archive-links.txt file.
  2. Sample sentiment tweets: we sampled the tweets that contain the sentiment-bearing smileys reported by (Go et al., 2009). Positive smileys include [':)', ':-)', ': )', ':D', '=)'] and negative smileys include [':(', ':-(', ': (']. We sample tweets that contain either positive or negative smileys, but not both. For the final dataset, we remove those smileys from the text.
  3. Deduplication: we remove duplicated tweets, i.e. retweets of other tweets already in the dataset, as they are the same in terms of content.

Reference

Please cite the following paper if you make use of this dataset for your research work:

Arkaitz Zubiaga.
Exploiting Class Labels to Boost Performance on Embedding-based Text Classification.
arXiv. 2020.

Download

The dataset is provided as a list of tweet IDs, which can be complemented by downloading the tweets from the Internet Archive, for which links are provided too. Please get in touch if you have any problems regenerating the dataset.