DFKI-LT - Very large language models for machine translation

Christian Federmann
Very large language models for machine translation
1 Mastersthesis, Universität des Saarlandes, Saarbrücken, Germany, 7/2007
Current state-of-the-art statistical machine translation relies on statistical language models which are based on n-grams and model language data using a Markov approach. The quality of the n-gram models depends on the n-gram order which is chosen when the model is trained. As machine translation is of increasing importance we have investigated extensions to improve language model quality. This thesis will present a new type of language model which allows the integration of very large language models into the Moses MT framework. This approach creates an index from the complete n-gram data of a given language model and loads only this index data into memory. Actual n-gram data is retrieved dynamically from hard disk. The amount of memory that is required to store such an indexed language model can be controlled by the indexing parameters that are chosen to create the index data. Further work done for this thesis included the creation of a standalone language model server. The current implementation of the Moses decoder is not able to keep language model data available in memory, instead it is forced to re-load this data each time the decoder application is started. Our new language model server moves language model handling into a dedicated process. This approach allows us to load n-gram data from a network or internet server and can also be used to export language model data to other applications using a simple communication protocol. We conclude the thesis work by creating a very large language model out of the n-gram data contained within the Google 5-gram corpus released in 2006. Current limitations within the Moses MT framework hindered our evaluation efforts, hence no conclusive results can be reported. Instead further work and improvements to the Moses decoder have been identified to be required before the full potential of very large language models can be efficiently exploited.
Files: BibTeX, thesis.pdf