Exploring cross-language statistical machine translation for closely related South Slavic languages

Maja Popovic; Nikola Ljube¨ić

In: EMNLP Workshop on Language Technologies for closely related languages and language variants. Conference on Empirical Methods in Natural Language Processing (EMNLP-14), October 25-29, Doha, Qatar, EMNLP, 10/2014.


This work investigates the use of cross-language resources for statistical machine translation (SMT) between English and two closely related South Slavic languages, namely Croatian and Serbian. The goal is to explore the effects of translating from and into one language using an SMT system trained on another. For translation into English, a loss due to cross-translation is about 13% of BLEU and for the other translation direction about 15%. The performance decrease for both languages in both translation directions is mainly due to lexical divergences. Several language adaptation methods are explored, and it is shown that very simple lexical transformations already can yield a small improvement, and that the most promising adaptation method is using a Croatian-Serbian SMT system trained on a very small corpus.


final.pdf (pdf, 133 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence