Skip to main content Skip to main navigation


Linguistically Motivated Evaluation of Machine Translation Metrics based on a Challenge Set

Eleftherios Avramidis; Vivien Macketanz
In: Proceedings of the Seventh Conference on Machine Translation. Conference on Machine Translation (WMT-2022), December 7-8, Abu Dhabi, United Arab Emirates, Association for Computational Linguistics, 12/2022.


We employ a linguistically motivated challenge set in order to evaluate the state-of-the-art machine translation metrics submitted to the Metrics Shared Task of the 7th Conference for Machine Translation. The challenge set includes about 20,000 items extracted from 145 MT systems for two language directions (German $\Leftrightarrow$ English), covering more than 100 linguistically-motivated phenomena organized in a dozen of categories. The best performing metrics are YiSi-1, COMET-22 and BLEURT for German-English, and XL-DA for English-German, followed by BLEURT, COMET-22 and UniTE. Metrics in both directions are performing worst when it comes to named-entities \& terminology. Particularly in German-English they are weak at detecting issues at punctuation, polar questions and idiom. In English-German, they perform worst at future II progressive of intransitive verbs, focus particles and present progressive of transitive verbs.