Publication

Annotation of Error Types for a German News Corpus

Andrew Bredenkamp; Judith Klein; Berthold Crysmann

In: ATALA sur les Corpus Annotés pour la Syntaxe Treebanks. Les journées d'Etude de l'ATALA, June 18-19, Paris, France, 1999.

Abstract

This paper will discuss the corpus annotation effort in the FLAG project and its application for assisting in the development of controlled language and grammar checking applications. The main aim of theGerman government funded FLAGproject1 is to develop technologies for controlled language (CL) and grammar checking applications for German. The project work has therefore been divided into two separate but complementary streams of activity. Firstly, the aim was to develop an modular NLP software architecture for quickly developing different kinds of CL and grammar checking applications. Secondly, to validate the first activity, it was seen as important to build up an empirical base for testing and formally evaluating checking components. Given the lack of existing annotated corpora of errors for German (or indeed for any language as far as the authors know), the construction of such a corpus was a high priority task. This would enable us not only to perform quantitative tests, but also to derive an empirically based typology of errors which the project could use for orientation. The corpus was particularly important given the approach which the FLAG project was taking to the task of grammar and controlled language checking, which relies on a phenomenonoriented approach to the problem of identifying errors, using shallow processing techniques. In order to finetune the heuristics which are central to such an approach, i.e. one based on identifying "candidate errors" of increasing probability, it is essential to have good test suites annotated with respect to the phenomena under investigation. The annotation of the corpus was to be carried out in such a way that we could easily access and quantify snapshots of the data, for producing test suites for testing purposes and for producing statistics on the frequency of particular error types. The research community not only lacked an annotated corpus of errors, there was no existing ontology of errors which could be easily translated into an annotation schema. The definition of such a schema based on traditional descriptions of errors (such as Luik, 1993a; Luik, 1993b) thus formed the first major workpackage. Fortunately, tools for the annotation of corpora, and the management thereof are becoming increasingly sophisticated; it was therefore necessary to evaluate a number of tools in the light of our specific needs.

Annotation of Error Types for a German News Corpus

Abstract

More links