Implementation and Evaluation of a Language Identification System for Mono- and Multi-lingual Texts

Olga Artemenko, Thomas Mandl, Margaryta Shramko, Christa Womser-Hacker

Abstract

Language identification is a classification task between a pre-defined model and a text in an unknown language. This paper presents the implementation of a tool for language identification for mono- and multi-lingual documents. The tool includes four algorithms for language identification. An evaluation for eight languages including Ukrainian and Russian and various text lengths is presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The tool can also identify language changes within one multi-lingual document.

[article]