DFKI-LT - Dissertation Series

Vol. XXIII

Gerhard Fliedner: Linguistically Informed Question Answering

ISBN: 3-933218-22-3
462 pages
price: € 24

order form

Question Answering has recently received lively interest from researchers in information sciences. Especially the establishment of shared competitions has attracted a lot of attention (Voorhees and Dang, 2006; Magnini et al., 2006).

A question answering (QA) system takes questions in natural language as its input, searches for suitable answers in large document collections and presents a comprehensive answer to the user (Maybury, 2004b).

Question answering is often seen as a special case of information retrieval and most QA systems are based upon methods from information retrieval and use only little linguistic information (Hirschman and Gaizauskas, 2001).

Question Answering systems therefore often have difficulties in reliably pinpointing answers. These difficulties typically lead to low answer precision (i.e., a high proportion of the systemís answers are wrong) and a dependence on answer redundancy (i.e., the systems must hit upon an answer several times in different formulations).

We suggest that looking at QA as a problem of linguistics, and especially computational linguistics, opens up interesting new perspectives for usable QA systems and provides solutions for the mentioned problems.

We accordingly take work on questions and answers in linguistics as a starting point and explore phenomena associated with questions and answers in human interaction. We find that syntactic, semantic and pragmatic aspects play a rŰle in describing answerhood (i.e., the relation that holds between a question and its answer). We note that work on questions and answers in linguistics is concerned with describing and explaining the relation of answerhood between a given question and a given answer. For finding answers in document collections, this view is not well-suited, as it is often necessary to derive the answer from the text through a number of additional inferencing steps, specifically when there are differences in wording between question and text. From the linguistic concept of indirect answerhood (Higginbotham and May, 1981), we derive indirect answerhood as a key concept for linguistically informed QA: Indirect answerhood is a relation between a text and a question that holds exactly if an answer to the question can be inferred from the text.

One obvious approach to automatic question answering would therefore be to derive semantic representations from all texts in the document collection and from the usersí questions and employ automated reasoning methods to infer answers from the information in the document collection. However, this approach fails as a basis for practical applications for a number of reasons. The two most important issues are that it is currently not possible to derive full semantic representations from general texts and that knowledge bases that would be needed as a source of inferences do not provide adequate coverage (cf. 3.3).

We therefore specify a more shallow approximation of indirect answerhood. This approximation is based on the matching of syntactic dependency tree structures extended with lexical semantic information: If a match can be found between a question representation and an answer representation, answerhood is assumed to hold between the respective question and answer. We complement this matching with inference rules, expressed as relabellings of the linguistic structures.

This specification of the inference rules allows to easily integrate information from arbitrary sources in a modular fashion. We currently employ chiefly two linguistic resources, namely GermaNet (the German version of WordNet) and FrameNet, as sources from which inference rules are derived.

In order to arrive at a practically usable search algorithm from the notion of tree matching, we develop an algorithm that is based on an efficient algorithm for unordered path inclusion (Kilpelšinen, 1992, O(n5/2)). We modify the algorithm so that it combines search and inferences in one interleaved process.

We further adapt the algorithm so that it can be used with a standard relational database system to allow storage of large document collections and to speed up retrieval.

To show the practicability of our approach, we have implemented SQUIGGLI, a prototype QA system for German. The core of the system is a Natural Language Processing chain that derives syntactic dependency structures from German texts, enriches them with lexical semantic information and resolves anaphoric references. Using this processing chain, text representations are derived from a given document collection and stored in a relational database. The QA system itself uses the processing chain to translate questions into representations in the same format, searches for answers in the database and presents answers generated from the retrieved representations using a generation module to the user. Additional features, like focussed answer justification and anaphora resolution in questions provide a basis for interactive QA.

The results of an end-to-end evaluation proved that the system reaches an interesting level of combined performance: It correctly answered over 33% of the questions, with an answer precision of over 66%.

The search algorithm is quite efficient: The majority of the answers were found within a few seconds, even though searching is done using structured linguistic information over the representations of the full document collection.

We aim to show that linguistically informed QA forms an interesting new approach to QA and that it can advantageously be used as a basis of building user-friendly, interactive QA systems.