« Back to publications

TweetMT: A Parallel Microblog Corpus

Iñaki San Vicente, Iñaki Alegria, Nora Aranberri, Cristina España-Bonet, Pablo Gamallo, Hugo Gonçalo Oliveira, Eva Martinez Garcia, Antonio Toral, Arkaitz Zubiaga

LREC. 2016.

Download PDF file
We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.
@inproceedings{san2016tweetmt,
  title={TweetMT: A Parallel Microblog Corpus},
  author={San Vicente, I{\~n}aki and Alegr{\'\i}a, I{\~n}aki and Espa{\~n}a-Bonet, Cristina and Gamallo, Pablo and Oliveira, Hugo Gon{\c{c}}alo and Garcia, Eva Mart{\'\i}nez and Toral, Antonio and Zubiaga, Arkaitz and Aranberri, Nora},
  booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)},
  pages={2936--2941},
  year={2016}
}