Scaling Character-Based Morphological Tagging to Fourteen Languages

Georg Heigold, Günter Neumann, Josef van Genabith

In: Proceedings of IEEE International Conference on Big Data. IEEE International Conference on Big Data (IEEE BigData-16) IEEE BigData December 5-8 Washinton, DC DC United States IEEE 12/2016.


This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets. Character-based approaches are attractive as they can handle rarely- and unseen words gracefully. More specifically, beside a rich morphology, non-canonical language, change of language or other linguistic variability can heavily degrade the accuracy of natural language processing of web and CMC data. We evaluate on 14 languages and observe consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where we match the state-of-the-art. The gains are clearly correlated with the amount of training data. We present supplementary experiments to explore whether and to what extent unsuper- vised data through pre-trained word vectors can compensate for limited amounts of supervised data. Moreover, we show preliminary results to study the effect of noisy input data by flipping characters at random.


Big_NLP_2016_(7).pdf (pdf, 417 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence