Publikation

Mining parallel resources for machine translation from comparable corpora

Santanu Pal, Partha Pakray, Alexander Gelbukh, Josef van Genabith

In: Alexander Gelbukh (Hrsg.). Computational Linguistics and Intelligent Text Processing. Seiten 534-544 Lecture Notes in Computer Science (LNCS) 9041 ISBN ISBN 978-3-319-18110-3 Springer 4/2015.

Abstrakt

Abstract Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English-Bengali comparable sentences from comparable documents.

Weitere Links

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence