- New! Tweet geolocation 5m: This is a dataset with more than 5 million geolocated tweets with detailed geolocation information associated. Each geolocated tweet is associated with its fine-grained location information, collected from OpenStreetMap using the reverse geocoding feature in Nominatim.
- New! PHEME rumour dataset: This is a dataset of conversations around rumours associated with 9 different breaking news stories, collected from Twitter. It was developed within the journalism use case of the PHEME FP7 project. Each tweet is annotated for support, certainty, and evidentiality.
- TweetMT: A dataset for machine translation of tweets.
- TweetLID: A dataset for tweet language identification, which includes 35k tweets with manually annotated language labels.
- Hurricane Sandy tweets: Nearly 15 million tweets posted on Twitter while Hurricane Sandy was hitting the East Coast of the United States, as well as in the aftermath.
- ODPtweets: A large-scale Twitter dataset with nearly 25 million tweets categorized in the structure of the Open Directory Project (ODP).
- tweet-norm_es: Tweets in Spanish language, annotated for lexical normalization purposes. Created for the tweet normalization challenge at Tweet-Norm 2013.
- Trending topics: A dataset with 1,036 categorized trending topics, which we used in Real-Time Classification of Twitter Trends
Social tagging datasets
- SocialBM0311: A large-scale, long-term social tagging dataset collected from Delicious.com. It contains the complete bookmarking activity for 2 million users from the launch of the social bookmarking website in 2003 to the end of March 2011.
- Social-ODP-2k9: 12,616 unique URLs, with categories from the Open Directory Project (ODP/Dmoz) and a variety of social annotations (tags, notes, reviews,...) retrieved from Delicious and StumbleUpon.
- DeliciousT140: 144,574 unique URLs, with social tags retrieved from Delicious.
- Wiki10+: 20,764 English Wikipedia articles, with social tags retrieved from Delicious.