MultiUN: A Multilingual Corpus from United Nation Documents

Andreas Eisele, Yu Chen

In: Daniel Tapias, Mike Rosner, Stelios Piperidis, Jan Odjik, Joseph Mariani, Bente Maegaard, Khalid Choukri, Nicoletta Calzolari (Conference Chair) (editor). Proceedings of the Seventh conference on International Language Resources and Evaluation. International Conference on Language Resources and Evaluation (LREC-2010) May 19-21 La Valletta Malta Pages 2868-2872 ISBN 2-9517408-6-7 European Language Resources Association (ELRA) 5/2010.


This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz