DFKI-LT - Using Related Languages to Enhance Statistical Language Models

Anna Currey, Alina Karakanta, Jonathan Dehdari
Using Related Languages to Enhance Statistical Language Models
1 The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Pages 24-31, San Diego, California, USA, Association for Computational Linguistics, Association for Computational Linguistics, 6/2016
 
The success of many language modeling methods and applications relies heavily on the amount of data available. This problem is further exacerbated in statistical machine translation, where parallel data in the source and target languages is required. However, large amounts of data are only available for a small number of languages; as a result, many language modeling techniques are inadequate for the vast majority of languages. In this paper, we attempt to lessen the problem of a lack of training data for low-resource languages by adding data from related high-resource languages in three experiments. First, we interpolate language models trained on the target language and on the related language. In our second experiment, we select the sentences most similar to the target language and add them to our training corpus. Finally, we integrate data from the related language into a translation model for a statistical machine translation application. Although we do not see many significant improvements over baselines trained on a small amount of data in the target language, we discuss some further experiments that could be attempted in order to augment language models and translation models with data from related languages.
 
Files: BibTeX, #2000