logo

MATE Deliverable D1.1

Supported Coding Schemes

1. Overview



Introduction
MATE's Aims
Scheme Evaluation Guidelines
Further Content of this Document


 

Introduction

During the last years, corpus based approaches have gained significant importance in the field of natural language processing (NLP). Large corpora for many different languages are currently being collected all over the world. In order to reuse this amount of data for training and testing purposes of NLP systems, the corpora must be annotated in various ways [Carletta et al. 1997]. This annotation assumes an underlying coding scheme. The way such schemes are designed depends on the task and the linguistic phenomena on which developers focus. The author's own style also has its effects on the scheme. For these reasons reusability of annotated corpora is extremely complicated.

The Discourse Resource Initiative (DRI) was started as an effort to assemble discourse resources to support discourse research and application. The goal of this initiative is to develop a standard for semantic / pragmatic and discourse features of annotated corpora [Carletta et al. 1997]. Another project, LE-EAGLES, also has the goal to provide preliminary guidelines for the representation or annotation of dialogue resources for language engineering [Leech et al. 1998]. These guidelines cover the areas of orthographic transcription, morpho-syntactic, syntactic, prosodic, and pragmatic annotation. But instead of developing a standard they describe the most used schemes, mark-up languages and systems for annotation.
 

MATE's Aims

MATE aims to develop a preliminary form of standard concerning annotation schemes on the levels of prosody, morpho-syntax, co-reference, dialogue acts, and communication problems, as well as their interaction.

MATE's annotation standard is meant to be closely related to the standardisation efforts in the US, Europe and Japan and will thus build on the work of DRI and EAGLES, mentioned above.

The annotation standard will allow multi-linguality and the co-existence of a multitude of coding schemes. This report provides the basis for the decision on which existing coding schemes MATE should support. It represents a broad overview on current schemes and covers all levels under consideration.

The information collected in this report was collected from the web, from recent proceedings and through personal contact. In the future we will continue our search of schemes which, by accident, were not included in this report. A web version of this report will be available and regularly updated even after the deadline of deliverable D1.1.

The results of this report will feed into the work on implementation of the MATE workbench (WP3) which is a tool box in support of the MATE standard, and they will form the basis of the definition of level mark-up (WP2).
 

Scheme Evaluation Guidelines

Lots of research has been done in the field of annotation schemes. Therefore one has to carefully look at all schemes to make the right decision whether the scheme will be supported in the MATE project or if it doesn't seem to be reliable enough and, hence, has to be omitted. To ease this, decision guidelines have been used which are listed below: All schemes which have been observed are tested with these guidelines. A detailed listing of all schemes can be found in the Appendices.
 

Further Content of this Document

Chapters 2 to 7 provide insight into the level-stages on communication problems, co-reference, cross-level issues, dialogue acts, (morpho)-syntax and prosody. The observed schemes which can be found in Annexes are compared with regard to their levels. At the end of each chapter conclusions concerning the supported schemes are drawn from these comparisons.

In chapter 8, a summary of the results of the research on the different levels is presented and further work that could be done in this field is outlined.

Last Modification: 28.8.1998 by Marion Klein