Skip to main content Skip to main navigation


Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic Evaluation

Neslihan Iskender; Tim Polzehl; Sebastian Möller
In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. ACL Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP-2020), located at co-located with EMNLP 2020, November 12, Punta Cana, Dominican Republic, Pages 164-175, EMNLP | Eval4NLP, Association for Computational Linguistics (ACL), 2020.


One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.

Weitere Links