DFKI-LT - Dissertation Series

Vol. IX

Sonja Müller-Landmann: Corpus-based Parse Pruning: Applying Empirical Data to Symbolic Knowledge

ISBN: 3-933218-09-8
173 pages
price: € 13

order form

On parsing natural language, the number of syntactically ambiguous situations inevitably grows with the coverage of the grammar. Therefore, most broad-coverage applications use one or other supplementary mechanism to decide on the respective probability of several ambiguous (partial) analyses. In this thesis, I propose corpus-based parse pruning: A database of probabilistically weighted, multi-level constituent structures is generated from a stratificational German corpus and utilized as a backbone for a broad-coverage dependency grammar (Slot Grammar). This pruning approach yields high-quality parsing results. An extensive evaluation of the syntactic variety in the training corpus and a series of experiments on quantity and quality of the constituent structures used for pruning give further insight into the criteria that help a language model to get representative and dynamically adaptable: Corpus size, a multi-purpose annotation scheme, and a wide variety of authors.