Improving Machine Translation Performance Using Comparable Corpora

Andreas Eisele; Jia Xu

In: Serge Sharoff Pierre Zweigenbaum Reinhard Rapp (Hrsg.). Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Workshop on Building and Using Comparable Corpora (BUCC-3), Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities, La Valletta, Malta, Pages 35-41, ISBN 2-9517408-6-7, European Language Resources Association (ELRA), 5/2010.


The overwhelming majority of the languages in the world are spoken by less than 50 million native speakers, and automatic translation of many of these languages is less investigated due to the lack of linguistic resources such as parallel corpora. In the ACCURAT project we will work on novel methods how comparable corpora can compensate for this shortage and improve machine translation systems of under-resourced languages. Translation systems on eighteen European language pairs will be investigated and methodologies in corpus linguistics will be greatly advanced. We will explore the use of preliminary SMT models to identify the parallel parts within comparable corpora, which will allow us to derive better SMT models via a bootstrapping loop.

accuratfinal.pdf (pdf, 83 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence