DFKI-LT - Dissertation Series


Hagen Fürstenau: Semi-supervised Semantic Role Labeling via Graph Alignment

ISBN: 978-3-933218-31-5
208 pages
price: € 17

order form

Semantic roles, which constitute a shallow form of meaning representation, have attracted increasing interest in recent years. Various applications have been shown to benefit from this level of semantic analysis, and a large number of publications has addressed the problem of semantic role labeling, i.e., the task of automatically identifying semantic roles in arbitrary sentences. A major limiting factor for these approaches, however, is the need for large manually labeled semantic resources to train semantic role labeling systems in the supervised learning paradigm. Consequently, the application of such systems is still limited to the small number of languages and domains for which sufficiently large semantic resources are available.

This thesis addresses the knowledge acquisition problem of semantic role labeling, i.e., the substantial annotation effort required for the creation of semantic resources that can be used to train state-of-the-art semantic role labeling systems.

Our main contribution is to formulate a semi-supervised approach to semantic role labeling, which requires only a small manually labeled corpus of role-annotated sentences. This initial seed corpus is augmented with annotation instances generated automatically from a large unlabeled corpus. The augmented corpus is used as training data for a supervised role labeler, to improve labeling performance over what can be attained when training on the manually labeled sentences alone. Our approach therefore reduces the annotation effort required to attain satisfactory performance and thus alleviates the knowledge acquistion problem, especially for languages and domains where the cost of annotating large semantic resources is prohibitive.

The key idea of our semi-supervised approach is to measure the similarity between labeled sentences from the manually annotated resource and sentences from a large unlabeled corpus. Similarity is conceptualized in terms of optimal graph alignments, which are employed to project annotations from labeled to unlabeled sentences. To select a set of novel training instances, similarity is operationalized as a measure of confidence, allowing us to limit the adverse effect of erroneous annotations. The optimization problem is formulated as an integer linear program and solved efficiently.

The thesis broadly consists of two parts. In the theoretical part, our semi-supervised approach to semantic role labeling is described in detail. The empirical part then evaluates the effect of this method on various corpora extracted from existing semantic resources for English and German. These experiments show that the additional training data generated by our method can indeed improve the performance of a semantic role labeler and thus reduce annotation effort in practice.