DFKI-LT - How can we measure machine translation quality?

Christian Federmann
How can we measure machine translation quality?
1 Proceedings of the Tralogy Conference, Paris, France, L’Institut de l’Information Scientifique et Technique, 12/2011,
Participant in panel discussion on MT evaluation with Philipp Koehn and John Moran, chaired by Josef van Genabith.

In this opinion paper, we describe our research work on machine translation evaluation approaches that include mechanisms for human feedback and are designed to allow partial adaptation of the translation models which are being evaluated. While there exists a plethora of different automatic evaluation metrics for machine translation, their output in terms of scores, distances, etc. quite often is neither transparent to translators nor shows good correlation with manual evaluation by human experts. Even worse, machine translation tuning efforts based on these automatic metrics to a certain extent move the research focus into a wrong direction; shifting it from « good » translations to those with a « higher scores ». This further widens the gap between machine translation research and translation producers or users. We first describe several automatic metrics which are being used in current machine translation research. Afterwards we provide a brief overview on manual evaluation techniques which are used in our machine translation group. As minimum error rate training for tuning of (statistical) machine translation system is an important part of the workflow, we think that a (semi-) automatic implementation of such evaluation tasks would be a helpful extension of current state-of-the-art machine translation systems. We conclude by describing the need to shift from automated metrics to consumer-oriented, semi-automatic evaluation as this seems to be highly important to allow more advanced MT techniques to see wider acceptance and usage in real life applications.
Files: BibTeX, index.php