« Back to publications

Threatening Language Detection and Target Identification in Urdu Tweets

Maaz Amjad, Noman Ashraf, Alisa Zhila, Grigori Sidorov, Arkaitz Zubiaga, Alexander Gelbukh

IEEE Access. 2021.

Download PDF fileAccess publication
Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n-gram counts or word n-gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word n -gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.