Topic Identification of Noisy Arabic Texts Based on KNN

Abstract

The amount of textual information written in Romanized Arabic (or Arabizi) is increasing exponentially day-by-day, when investigating automatic methods to process such texts is becoming a need. Hence, in this investigation, we are addressing the identification of Algerian sub-dialects of social media comments written in Romanized Arabic. Moreover, we address the Arabizi-French code-switching phenomenon. To the best of our knowledge, this is the first work addressing the tackled problem on written documents. Accordingly, we propose a new corpus (DZDC12 corpus), and the general guidelines to collect the texts as well. As a first attempt to deal with the Algerian sub-dialects identification, we use two state-of-the-art tools of language identification (langid.py and LangDetect), as well as three classifiers (i.e. SVM, Multinomial NB and Gaussian NB) based on a heuristic of features selection. The evaluation conducted on the DZDC12 corpus showed low performances, as well as confirmed our expectation that the tackled problem requires an extensive study to select the reliable feature set.

Publication
International Conference on Information and Communication Technology Research
Date