Publication
Findings of the WMT25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation
Tom Kocmi; Sweta Agrawal; Ekaterina Artemova; Eleftherios Avramidis; Eleftheria Briakou; Pinzhen Chen; Marzieh Fadaee; Markus Freitag; Roman Grundkiewicz; Yupeng Hou; Philipp Koehn; Julia Kreutzer; Saab Mansour; Stefano Perrella; Lorenzo Proietti; Parker Riley; Eduardo Sánchez; Patricia Schmidtova; Mariya Shmatova; Vilém Zouhar
In: Barry Haddow; Tom Kocmi; Philipp Koehn; Christof Monz (Hrsg.). Proceedings of the Tenth Conference on Machine Translation. Conference on Machine Translation (WMT-25), November 8-9, Suzhou, China, Pages 414-435, ISBN 979-8-89176-341-8, Association for Computational Linguistics, 11/2025.
Abstract
The WMT25 Multilingual Instruction Shared Task (MIST) introduces a benchmark to evaluate large language models (LLMs) across 30 languages. The benchmark covers five types of problems: machine translation, linguistic reasoning, open-ended generation, cross-lingual summarization, and LLM-as-a-judge.We provide automatic evaluation and collect human annotations, which highlight the limitations of automatic evaluation and allow further research into metric meta-evaluation. We run on our benchmark a diverse set of open- and closed-weight LLMs, providing a broad assessment of the multilingual capabilities of current LLMs. Results highlight substantial variation across sub-tasks and languages, revealing persistent challenges in reasoning, cross-lingual generation, and evaluation reliability. This work establishes a standardized framework for measuring future progress in multilingual LLM development.
