Appraise: An Open-Source Toolkit for Manual Phrase-Based Evaluation of Translations

Christian Federmann

In: Daniel Tapias Mike Rosner Stelios Piperidis Jan Odjik Joseph Mariani Bente Maegaard Khalid Choukri Nicoletta Calzolari (Conference Chair) (Hrsg.). Proceedings of the Seventh conference on International Language Resources and Evaluation. International Conference on Language Resources and Evaluation (LREC-10) Seventh May 19-21 Valletta Malta ISBN 2-9517408-6-7 European Language Resources Association (ELRA) 5/2010.


We describe a focused effort to investigate the performance of phrase-based,human evaluation of machine translation output achieving a high annotatoragreement. We define phrase-based evaluation and describe the implementation ofAppraise, a toolkit that supports the manual evaluation of machine translationresults. Phrase ranking can be done using either a fine-grained six-way scoringscheme that allows to differentiate between "much better" and "slightlybetter", or a reduced subset of ranking choices. Afterwards we discuss kappavalues for both scoring models from several experiments conducted with humanannotators. Our results show that phrase-based evaluation can be used for fastevaluation obtaining significant agreement among annotators. The granularity ofranking choices should, however, not be too fine-grained as this seems toconfuse annotators and thus reduces the overall agreement. The work reported inthis paper confirms previous work in the field and illustrates that the usageof human evaluation in machine translation should be reconsidered. The Appraisetoolkit is available as open-source and can be downloaded from the author'swebsite.


197_Paper.pdf (pdf, 396 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence