Parallel Corpus Refinement as an Outlier Detection Algorithm

Kaveh Taghipour, Shahram Khadivi, Jia Xu

In: MT Summit XIII. Machine Translation Summit (MT Summit-11) 13. September 19-23 Xiamen China NA Xiamen 9/2011.


Filtering noisy parallel corpora or removing mistranslations out of training sets can improve the quality of a statistical machine translation. Discriminative methods for filtering the corpora such as a maximum entropy model, need properly labeled training data, which are usually unavailable. Generating all possible sentence pairs (the Cartesian product) to generate labeled data, produces an imbalanced training set, containing a few correct translations and thus inappropriate for training a classifier. In order to treat this problem effectively, unsupervised methods are utilized and the problem is modeled as an outlier detection procedure. The experiments show that a filtered corpus, results in an improved translation quality, even with some sentence pairs removed.


corpusFiltering2.pdf (pdf, 205 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence