Neural Text Categorizer for Topic Identification of Noisy Arabic Texts

Abstract

This paper deals with the topic identification problem, which consists of recognizing the subject in which the text is written. Despite there exist several statistical and machine learning approaches addressing the tackled problem, unfortunately, most of them assume relatively clean and long texts, and they present failure in corrupted or short texts. Moreover, there are few works were undergone on the Arabic language which is a rich language and the more complex one. For that reason, we aimed to conduct our investigation in topic identification of noisy Arabic texts. To overcome the addressed problem, we present the design and implementation of the Neural Text Categorizer (NTC), which is a novel Neural Network and different from the existing NNs in some concepts. Furthermore, we present and discuss the proposed improvement of the NTC (called NTCT), where it is based on TF-IDF weights and consists of modifying the input vector and the classification formula. The empirical evaluation of the two algorithms was undergone on in-house corpus (called ANTSIX) containing discussion forum texts. We also carried out a comparison between our best findings and the state of the art. We found that the proposed NTCT maintained consistently high performances and outperformed several algorithms in topic identification of noisy Arabic texts.

Publication
International Conference on Computer Systems and Applications
Date