DFKI-LT - Dissertation Series

Vol. XXIV

Fei-Yu Xu : Bootstrapping Relation Extraction from Semantic Seeds

ISBN: 978-3-933218-23-0
159 pages
price: € 15

order form

Information Extraction (IE) is a technology for localizing and classifying pieces of relevant information in the unstructured natural language texts and detecting relevant relations among them. This thesis deals with one of the central tasks of IE, i.e., relation extraction. The goal is to provide a general framework that automatically learns mappings between linguistic analyses and target semantic relations, with minimal human invention. Furthermore, this framework is supposed to support the adaptation to new application domains and new relations with various complexities.

The central result is a new approach to relation extraction which is based on a minimally supervised method for automatically learning extraction grammars from a large collection of parsed texts, initialized by some instances of the target relation, called semantic seed. Due to the semantic seed approach, the framework can accommodate new relation types and domains with minimal effort. It supports relations of different arity as well as their projections. Furthermore, this framework is general enough to employ any linguistic analysis tools that provides the required type and depth of analysis.

The adaptability and the scalability of the framework is facilitated by the DARE rule representation model which is recursive and compositional. In comparison to other IE rule representation models, e.g., Stevenson and Greenwood (2006), the DARE rule representation model is expressive enough to achieve a good coverage of linguistic constructions for finding the mentions of the target relation. The powerful DARE rules are constructed via a bottomup and compositional rule discovery strategy, driven by the semantic seed. The control of the quality of newly acquired knowledge during the bootstrapping process is realized through a ranking and filtering strategy, taking two aspects into account: the domain relevance and the trustworthiness of the origin. A special algorithm is developed for the induction and generalization of the DARE rules. Since DARE also takes the projections of the target relation and the interaction among these into account, it opens new perspectives for the improvement of recall and reusability of the learned rules.

Various evaluations are conducted that help us obtain insights into the applicability, the potential and the limitations of the DARE framework. The comparison of the different data setups such as the size of the semantic seed, the data size and the data source tells us that data properties play an important role for the success of DARE. Furthermore, the evaluation confirms our earlier findings on the influence of proper seed construction for system performance. The detailed qualitative analysis of the DARE system output encourages us to integrate richer high-quality linguistic processing including discourse analysis.