Using Related Languages to Enhance Statistical Language Models

Anna Currey, Alina Karakanta, Jon Dehdari

In: The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. NAACL-HLT Student Research Workshop (NAACL-SRW-2016) June 12-17 San Diego California United States Seiten 24-31 ISBN 978-1-941643-81-5 Association for Computational Linguistics 6/2016.


The success of many language modeling methods and applications relies heavily on the amount of data available. This problem is further exacerbated in statistical machine translation, where parallel data in the source and target languages is required. However, large amounts of data are only available for a small number of languages; as a result, many language modeling techniques are inadequate for the vast majority of languages. In this paper, we attempt to lessen the problem of a lack of training data for low-resource languages by adding data from related high-resource languages in three experiments. First, we interpolate language models trained on the target language and on the related language. In our second experiment, we select the sentences most similar to the target language and add them to our training corpus. Finally, we integrate data from the related language into a translation model for a statistical machine translation application. Although we do not see many significant improvements over baselines trained on a small amount of data in the target language, we discuss some further experiments that could be attempted in order to augment language models and translation models with data from related languages.

Weitere Links

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence