« Back to publications

TweetLID: A Benchmark for Tweet Language Identification

Arkaitz Zubiaga, Iñaki San Vicente, Pablo Gamallo, José Ramom Pichel, Iñaki Alegria, Nora Aranberri, Aitzol Ezeiza, Víctor Fresno

Language Resources and Evaluation. 2016.

Download PDF fileAccess publication
Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (i) distinction of similar languages, (ii) detection of multilingualism in a single document, and (iii) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another.
  title={Tweetlid: a benchmark for tweet language identification},
  author={Zubiaga, Arkaitz and Vicente, I{\~n}aki San and Gamallo, Pablo and Pichel, Jos{\'e} Ramom and Alegria, Inaki and Aranberri, Nora and Ezeiza, Aitzol and Fresno, V{\'\i}ctor},
  journal={Language Resources and Evaluation},