DFKI-LT - N-Gram Language Modeling for Robust Multi-Lingual Document Classification

Jörg Steffen
N-Gram Language Modeling for Robust Multi-Lingual Document Classification
1 Proceedings of the 4th International Conference on Language Resources and Evaluation, Pages 731-734, ELRA, ELRA, Paris, France, 2004
 
Statistical n-gram language modeling is used in many domains like speech recognition, language identification, machine translation, character recognition and topic classification. Most language modeling approaches work on n-grams of terms. This paper reports about ongoing research in the MEMPHIS project which employs models based on character-level n-grams instead of term n-grams. The models are used for the multi-lingual classification of documents according to the topics of the MEMPHIS domains. We present methods capable of dealing robustly with large vocabularies and informal, erroneous texts in different languages. We also report on our results of using multi-lingual language models and experimenting with different classification parameters like smoothing techniques and n-grams lengths.
 
Files: BibTeX, lrec04-classification.pdf