Robust Language Identification of Noisy Texts: - Proposal of Hybrid Approaches


This paper deals with the problem of automatic language identification of noisy texts, which represents an important task in natural language processing. Actually, there exist several works in this field, which are based on statistical and machine learning approaches for different categories of texts. Unfortunately, most of the proposed methods work fine on clean texts and/or long texts, but often present a failure when the text is corrupted or too short. In this research work, we use a typical dataset consisting of short texts collected from several discussion forums containing several types of noises. Our dataset contains 32 different languages; where we notice that some languages are quite different while some others are too closed. In this investigation, we propose two types of methods to identify the text language: term-based method and character-based method. Moreover, we propose two hybrid methods to enhance the performances of those techniques. Experiments show that the proposed hybrid methods are quite interesting and present good language identification performances in noisy texts.

International Workshop on Text-based Information Retrieval (TIR-DEXA)