DFKI-LT - Parallel Corpus Refinement as an Outlier Detection Algorithm

Kaveh Taghipour, Shahram Khadivi, Jia Xu
Parallel Corpus Refinement as an Outlier Detection Algorithm
MT Summit XIII, Xiamen, China, NA, Xiamen, 9/2011
 
Filtering noisy parallel corpora or removing mistranslations out of training sets can improve the quality of a statistical machine translation. Discriminative methods for filtering the corpora such as a maximum entropy model, need properly labeled training data, which are usually unavailable. Generating all possible sentence pairs (the Cartesian product) to generate labeled data, produces an imbalanced training set, containing a few correct translations and thus inappropriate for training a classifier. In order to treat this problem effectively, unsupervised methods are utilized and the problem is modeled as an outlier detection procedure. The experiments show that a filtered corpus, results in an improved translation quality, even with some sentence pairs removed.
 
Files: BibTeX, corpusFiltering2.pdf