DFKI-LT - Parallel Corpus Refinement as an Outlier Detection Algorithm
Parallel Corpus Refinement as an Outlier Detection Algorithm
1 MT Summit XIII, Xiamen, China, NA, Xiamen, 9/2011
Filtering noisy parallel corpora or removing mistranslations out of training sets can improve the quality of a statistical machine translation. Discriminative methods for filtering the corpora such as a maximum entropy model, need properly labeled training data, which are usually unavailable. Generating all possible sentence pairs (the Cartesian product) to generate labeled data, produces an imbalanced training set, containing a few correct translations and thus inappropriate for training a classifier. In order to treat this problem effectively, unsupervised methods are utilized and the problem is modeled as an outlier detection procedure. The experiments show that a filtered corpus, results in an improved translation quality, even with some sentence pairs removed.
Files: BibTeX, corpusFiltering2.pdf