OCAS: Ontology-Based Corpus and Annotation Scheme. Towards an OBIE Gold Standard that Contains even Implicit Facts

Alexander Grothkast; Benjamin Adrian; Kinga Schumacher; Andreas Dengel
In: Sebastian Blohm; Ulf Brefeld; Felix Jungermann; Roman Yangarber (Hrsg.). Proceedings of the High-level Information Extraction Workshop 2008. High-level Information Extraction Workshop (HLIE-2008), located at ECML PKDD 2008, September 15-19, Antwerpen, Belgium, Pages 25-35, ECML PKDD 2008, 2008.


This paper presents strategies and lessons learned from the creation of a corpus. It suggests a gold standard for evaluating ontology-based information extraction (OBIE) systems. This OBIE gold standard is called OCAS2008 and consists of: (i) an OBIE layer cake for comparing OBIE systems by subtasks, (ii) a document corpus of 121 documents with 31,000 words about a closed domain, (iii) a compact domain ontology including more than 40,000 instances, (iv) two annotation scenarios that extend traditional template-based evaluations, (v) an annotation set that contains typed annotations according to the ontology and the OBIE layer cake, (vi) annotations that concern text phrases, symbols, instances, explicitly written facts, implicit facts, and (vii) finally, human created annotations according to predefined specifications. We claim that the use of OCAS2008 provides a basis for comparable and significant evaluations of OBIE systems.



