An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation

Li Ling Tan, Jon Dehdari, Josef van Genabith

In: Proceedings of the Workshop on Asian Translation (WAT-2015). Workshop on Asian Translation (WAT-15) October 16 Kyoto Japan Seiten 74-81 Association for Computational Linguistics 2015.


Automatic evaluation of machine translation (MT) quality is essential in developing high quality MT systems. Despite previous criticisms, BLEU remains the most popular machine translation metric. Previous studies on the schism between BLEU and manual evaluation highlighted the poor correlation between MT systems with low BLEU scores and high manual evaluation scores. Alternatively, the RIBES metric—which is more sensitive to reordering—has shown to have better correlations with human judgements, but in our experiments it also fails to correlate with human judgements. In this paper we demonstrate, via our submission to the Workshop on Asian Translation 2015 (WAT 2015), a patent translation system with very high BLEU and RIBES scores and very poor human judgement scores.

Weitere Links

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence