DFKI-LT - An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation

Li Ling Tan, Jonathan Dehdari, Josef van Genabith
An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation
3 Proceedings of the Workshop on Asian Translation (WAT-2015), Pages 74-81, Kyoto, Japan, Association for Computational Linguistics, 2015
 
Automatic evaluation of machine translation (MT) quality is essential in developing high quality MT systems. Despite previous criticisms, BLEU remains the most popular machine translation metric. Previous studies on the schism between BLEU and manual evaluation highlighted the poor correlation between MT systems with low BLEU scores and high manual evaluation scores. Alternatively, the RIBES metric—which is more sensitive to reordering—has shown to have better correlations with human judgements, but in our experiments it also fails to correlate with human judgements. In this paper we demonstrate, via our submission to the Workshop on Asian Translation 2015 (WAT 2015), a patent translation system with very high BLEU and RIBES scores and very poor human judgement scores.
 
Files: BibTeX, W15-5009.pdf