DFKI-LT - Scaling Character-Based Morphological Tagging to Fourteen Languages

Georg Heigold, GŁnter Neumann, Josef van Genabith
Scaling Character-Based Morphological Tagging to Fourteen Languages
3 Proceedings of IEEE International Conference on Big Data, Washinton, DC, DC, USA, IEEE, IEEE, 12/2016
 
This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets. Character-based approaches are attractive as they can handle rarely- and unseen words gracefully. More specifically, beside a rich morphology, non-canonical language, change of language or other linguistic variability can heavily degrade the accuracy of natural language processing of web and CMC data. We evaluate on 14 languages and observe consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where we match the state-of-the-art. The gains are clearly correlated with the amount of training data. We present supplementary experiments to explore whether and to what extent unsuper- vised data through pre-trained word vectors can compensate for limited amounts of supervised data. Moreover, we show preliminary results to study the effect of noisy input data by flipping characters at random.
 
Files: BibTeX, Big_NLP_2016 (7).pdf