Skip to main content Skip to main navigation

Publication

Preprocessing and Tokenisation Standards in DELPH-IN Tools

Ben Waldron; Ann Copestake; Ulrich Schäfer; Bernd Kiefer
In: Proceedings of the 5th International Conference on Language Resources and Evaluation LREC-2006. International Conference on Language Resources and Evaluation (LREC), Genoa, Italy, 5/2006.

Abstract

We discuss preprocessing and tokenisation standards within DELPH-IN, a large scale open-source collaboration providing multiple independent multilingual shallow and deep processors. We discuss (i) a component-specific XML interface format which has been used for some time to interface preprocessor results to the PET parser, and (ii) our implementation of a more generic XML interface format influenced heavily by the (ISO working draft) Morphosyntactic Annotation Framework (MAF). Our generic format encapsulates the information which may be passed from the preprocessing stage to a parser: it uses standoff-annotation, a lattice for the representation of structural ambiguity, intra-annotation dependencies and allows for highly structured annotation content. This work builds on the existing Heart of Gold middleware system, and previous work on Robust Minimal Recursion Semantics (RMRS) as part of an inter-component interface. We give examples of usage with a number of the DELPH-IN processing omponents and deep grammars.