DFKI-LT - Dissertation Series

Vol. VI

Thorsten Brants: Tagging and Parsing with Cascaded Markov Models - Automation of Corpus Annotation

ISBN: 3-933218-05-5
174 pages
price: € 13

order form

The methods presented in this thesis aim at automation of corpus annotation and processing of large corpora. Automation enables efficient generation of linguistically interpreted corpora, which on the one hand are a pre-requisite for theoretical linguistic investigations and the development of grammatical processing models. On the other hand, they are the basis for further development of corpus-based taggers and parsers and thereby take part in a bootstrapping process.

The presented methods are based on Markov Models, which model spoken or written utterances as probabilistic sequences. For written language processing, part-of-speech tagging is probably their most prominent application, i.e., the assignment of morpho-syntactic categories to words. We show that the technique used for part-of-speech tagging can be shifted to higher levels of linguistic annotations. Markov Models are suitable for a broader class of labeling tasks and for the generation of hierarchical structures.

While part-of-speech tagging assigns a category to each word, the presented method of tagging grammatical functions assigns a function to each word/tag pair. Going up in the hierarchy, Markov Models determine phrase categories for a given structural element.

The technique is further extended to implement a shallow parsing model. Instead of a single word or a single symbol, each state of the proposed Markov Models emits context-free partial parse trees. Each layer of the resulting structure is represented by its own Markov Model, hence the name Cascaded Markov Models. The output of each layer of the cascades is a probability distribution over possible bracketings and labelings for that layer. This output forms a lattice and is passed as input to the next layer.

After presenting the methods, we investigate two applications of Cascaded Markov Models: creation of resources in corpus annotation and partial parsing as pre-processing for other applications.

During corpus annotation, an instance of the model and a human annotator interact. Cascaded Markov Models create the syntactic structure of a sentence layer by layer, so that the human annotator can follow and correct the automatic output if necessary. The result is very efficient corpus annotation. Additionally, we exploit a feature that is particular to probabilistic models. The existence of alternative assignments and their probabilities are important information about the reliability of automatic annotations. Unreliable assignments can be identified automatically and may trigger additional actions in order to achieve high accuracies.

The second application uses Cascaded Markov Models without human supervision. The possibly ambiguous output of a lower layer is directly passed to the next layer. This type of processing is well suited for partial parsing (chunking), e.g., the recognition of noun phrases, prepositional phrases, and their constituents. Partial parsing delivers less information than deep parsing, but with much higher accuracy and speed. Both are important features for processing large corpora and for the use in applications like message extraction and information retrieval.

We evaluate the proposed methods using German and English corpora, representing the domains of newspaper texts and transliterated spoken dialogues. In addition to standard measures like accuracy, precision, and recall, we present learning curves by using different amounts of training data, and take into account selected alternative assignments. For the tasks of part-of-speech tagging and chunking German and English corpora, our results (96.3% - 97.7% for tagging, 85% - 91% recall, 88% - 94% precsision for chunking) are on a par with state-of-the-art results found in the literature. For the tasks of assigning grammatical functions and phrase labels and the interactive annotation task, our results are the first published.

The presented methods enabled the efficient annotation of the NEGRA corpus as their first practical application. Now, they are being successfully used for the annotation of several other corpora in different languages and domains, using different annotation schemes.