Publikation

Scaling Character-Based Morphological Tagging to Fourteen Languages

Georg Heigold, Günter Neumann, Josef van Genabith

In: Proceedings of IEEE International Conference on Big Data. IEEE International Conference on Big Data (IEEE BigData-16) IEEE BigData December 5-8 Washinton, DC DC United States IEEE 12/2016.

Abstrakt

This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets. Character-based approaches are attractive as they can handle rarely- and unseen words gracefully. More specifically, beside a rich morphology, non-canonical language, change of language or other linguistic variability can heavily degrade the accuracy of natural language processing of web and CMC data. We evaluate on 14 languages and observe consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where we match the state-of-the-art. The gains are clearly correlated with the amount of training data. We present supplementary experiments to explore whether and to what extent unsuper- vised data through pre-trained word vectors can compensate for limited amounts of supervised data. Moreover, we show preliminary results to study the effect of noisy input data by flipping characters at random.

Projekte

Big_NLP_2016_(7).pdf (pdf, 417 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence