N-Gram Language Modeling for Robust Multi-Lingual Document Classification

Jörg Steffen

In: Proceedings of the 4th International Conference on Language Resources and Evaluation. International Conference on Language Resources and Evaluation (LREC) Pages 731-734 ELRA 2004.


Statistical n-gram language modeling is used in many domains like speech recognition, language identification, machine translation, character recognition and topic classification. Most language modeling approaches work on n-grams of terms. This paper reports about ongoing research in the MEMPHIS project which employs models based on character-level n-grams instead of term n-grams. The models are used for the multi-lingual classification of documents according to the topics of the MEMPHIS domains. We present methods capable of dealing robustly with large vocabularies and informal, erroneous texts in different languages. We also report on our results of using multi-lingual language models and experimenting with different classification parameters like smoothing techniques and n-grams lengths.


German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz