Harnessing Web Page Directories for Large-Scale Classification of Tweets

Arkaitz Zubiaga, Heng Ji

WWW. 2013.

Classification is paramount for an optimal processing of tweets, albeit performance of classifiers is hindered by the need of large sets of training data to encompass the diversity of contents one can find on Twitter. In this paper, we introduce an inexpensive way of labeling large sets of tweets, which can be easily regenerated or updated when needed. We use human-edited web page directories to infer categories from URLs contained in tweets. By experimenting with a large set of more than 5 million tweets categorized accordingly, we show that our proposed model for tweet classification can achieve 82% in accuracy, performing only 12.2% worse than for web page classification.

Download PDF file