Knowledge-Intensive, High-Performance Relation Extraction

Sebastian Krause
PhD-Thesis, TU Berlin, 1/2018.


Research on information extraction (IE) from texts has attracted much attention for at least the past two decades. This is not surprising given its significance for applications such as personal digital assistants. Information extraction and its subtask relation extraction play a central role in data processing pipelines that make hidden knowledge such as the content of news articles available to downstream users. This thesis presents four main contributions to important questions of the corresponding research field. The first two contributions deal with various aspects of the automatic discovery of linguistic patterns, which we use for the detection of relations. We initially look at scenarios with predefined relations of interest. Here, state-of-the-art methods employ simplistic assumptions at training time, which has a drastic negative effect on both precision and coverage. We propose methods for the production and filtering of patterns that mitigate this shortcoming by leveraging existing knowledge about the target domains. Next, we address scenarios without a-priori relation definitions. Here, produced linguistic patterns need to be disambiguated to resolve their meaning, which is particularly hard for patterns in the long tail, which tend to get misinterpreted. Our proposed solution for this issue is the implementation of a global model that can generalize over many pattern occurrences and thus manages to handle rare patterns as well. The third contribution of this thesis focuses on the versatility of linguistic patterns beyond their designated use for extraction purposes. The patterns convey interesting information about the actual usage of language expressions, which is exactly what is missing in the current landscape of IE-relevant resources. More specifically, the relational information from world-knowledge graphs is not at all grounded in the language information present in lexical-semantic resources. We aim to remedy this de cit by proposing a construction methodology for a new kind of resource that is created by transforming many linguistic patterns into a single graph of language expressions. Finally, in the fourth contribution, we consider a fundamental shortcoming in the construction of systems for relation extraction, be they based on linguistic patterns or a different methodology. This flaw is the invalid premise that relational information is mostly contained within the boundaries of individual sentences. We initially address this problem with an analysis of its severity and follow-up by designing an approach that can easily be used to post-process the output of existing extraction systems and that allows them to produce cross-sentence relation mentions, and thereby resolves the design flaw.



Weitere Links