Fundamentally, question answering systems are designed for
automatically responding to queries posed by users in natural language.
The first step in the answering process is query analysis, and its
goal is to classify the query in concert with a set of pre-specified
types. Traditionally, these classes include: factoid, definition, and list.
Systems thereafter chose the answering method in congruence with
the class recognised in this early phase.
In short, this thesis focuses exclusively on strategies
to tackle definition questions
(e.g., ``Who is Ben Bernanke?").
This sort of question has become especially interesting in recent years,
due to its significant number of submissions to search engines.
Most advances in definition question answering have been made
under the umbrella of the Text REtrieval Conference (TREC).
This is, more precisely, a framework
for testing systems operating on a collection of news articles.
Thus, the objective of chapter one is to describe this framework
along with presenting additional introductory aspects of
definition question answering including:
(a) how definition questions are prompted by individuals;
(b) the different conceptions of definition, and thus of answers;
and (c) the various metrics exploited for assessing systems.
Since the inception of TREC, systems have put to the test
manifold approaches to discover answers, throwing some light onto
several key aspects of this problem.
On this account, chapter four goes over
a selection of some notable TREC systems.
This selection is not aimed at completeness, but rather
at highlighting the leading features of these systems.
For the most part, systems benefit from knowledge bases (e.g., Wikipedia) for
obtaining descriptions about the concept being
defined (a.k.a. definiendum).
These descriptions are thereafter
projected onto the array of candidate answers
as a means of discerning the correct answer. In other words, these
knowledge bases play the role of annotated resources, and most systems attempt
to find the answer candidates
across the collection of news articles
that are more similar to these descriptions.
The cornerstone of this thesis is the assumption that it is plausible
to devise competitive, and hopefully better, systems without the
necessity of annotated resources. Although this descriptive
knowledge is helpful, it is the belief of the author that they are
built on two wrong premises:
It is arguable that senses or contexts related to the
definiendum across knowledge bases are the same
senses or contexts for the instances across the array of answer candidates.
This observation also extends to the fact that not all descriptions
within the group of putative answers are necessarily covered by
knowledge bases, even though they might refer to the same contexts or senses.
Finding an efficient projection strategy does not necessarily
entail a good procedure for discerning descriptive knowledge, because
it shifts the goal of the task to a ``more like this set" instead of
analysing whether or not each candidate bears the characteristics of a
In other words,
the coverage given by knowledge bases for a specific
definiendum is not wide enough to learn all the
characteristics that typify its descriptions,
so that systems are capable of identifying all answers within
the set of candidates. From another angle, a conventional projection methodology can be seen
as a finder of lexical analogies.
All in all, this thesis investigates into models that disregard
this kind of annotated resource and projection strategy.
In effect, it is the belief of the author
that a robust technique of this sort can be integrated with
methodologies, and in this way bringing about an enhancement in performance.
The major contributions of this thesis are presented in chapters five, six and seven.
There are several ways of understanding this structure.
For example, chapter five
presents a general framework for answering
definition questions in several languages.
The primary goal of this study is to design a lightweight
definition question answering system operating on web-snippets
and two languages: English and Spanish.
The idea is to utilise web-snippets as a source of descriptive information
in several languages, and the high degree of language independency
is achieved by making allowances for as
little linguistic knowledge as possible. To put it more precisely,
this system accounts for statistical methods and
a list of stop-words, as well as a set of
language-dependent definition patterns.
In detail, chapter five branches into two more specific studies.
The first study is essentially aimed at
capitalising on redundancy for detecting answers (e.g., word frequency counts across answer candidates).
Although this type of feature has been widely used by TREC
systems, this study focuses on its impact on different languages, and its
benefits when applied to web-snippets instead of a collection of news documents.
An additional motivation behind targeting web-snippets is the hope of studying
systems working on more heterogenous corpora, without incurring the need
of downloading full-documents.
For instance, on the Internet, the number of distinct senses for
the definiendum considerably increases,
ergo making it necessary to consider a sense discrimination technique.
For this purpose, the system presented in this chapter takes advantage of an
unsupervised approach premised
on Latent Semantic Analysis.
Although the outcome of this study shows that
sense discrimination is hard to achieve when operating solely on
it also reveals that they are a fruitful source of descriptive
knowledge, and that their extraction poses exciting challenges.
The second branch extends this first study by exploiting multilingual
knowledge bases (i.e., Wikipedia) for ranking putative answers.
Generally speaking, it makes use of word association norms deduced from
sentences that match definitions patterns across Wikipedia.
In order to adhere to the premise of not profiting from articles
related to a specific definiendum, these sentences are anonymised
by replacing the concept with a placeholder, and the word norms
are learnt from all training sentences, instead of only from
the Wikipedia page about the particular definiendum.
The results of this study signify that
this use of these resources can also be beneficial; in particular,
they reveal that word association norms are a cost-efficient solution.
the size of the corpus markedly decreases for languages different from
English, thus indicating their insufficiency to design models for other
Later, chapter six gets more specific and deals only with
the ranking of answer candidates in English.
The reason for abandoning the idea of Spanish is
the sparseness observed
across both the redundancy from the Internet and
the training material mined from Wikipedia.
This sparseness is considerably greater than in the case of English,
and it makes learning powerful statistical models more difficult.
This chapter presents a novel way of modeling
definitions grounded on n-gram language models
inferred from the lexicalised dependency tree representation
of the training material acquired in the study of chapter five.
These models are contextual in the sense that they are built in relation
to the semantic of the sentence.
Generally, these semantics can be perceived as
the distinct types of definienda
(e.g., footballer, language, artist, disease, and tree).
This study, in addition, investigates the effect of some features on these context models
(i.e., named entities, and part-of-speech tags).
Overall, the results obtained by this approach are encouraging, in particular
in terms of increasing the accuracy of the pattern matching. However, in all
likelihood, it was experimentally observed that a training corpus
comprising only positive examples (descriptions) is
not enough to achieve perfect accuracy, because these models
cannot deduce the characteristics that typify
More essential, as future work, context models give the chance to study how different
contexts can be amalgamated (smoothed) in agreement with their
semantic similarities in order to ameliorate the performance.
Subsequently, chapter seven gets even more specific and it searches
for the set of properties that can aid in discriminating
descriptions from other kinds of texts.
Note that this study regards all kinds of descriptions, including
those mismatching definition patters.
In so doing, Maximum Entropy models are constructed
on top of an automatically acquired large-scale training corpus,
which encompasses descriptions from Wikipedia and
non-descriptions from the Internet.
Roughly speaking, different models are constructed as a means of
studying the impact of assorted properties: surface,
named entities, part-of-speech tags, chunks, and more interestingly,
attributes derived from the lexicalised dependency graphs.
In general, results corroborate the efficiency of features taken from
dependency graphs, especially the root node and n-gram paths. Experiments
conducted on testing sets of various characteristics suggest that it is also
plausible to find attributes that can port to other corpora.
The second and the third are extra chapters.
The former examines different strategies to trawl the Web for
descriptive knowledge. In essence, this chapter touches on several
strategies geared towards boosting the recall of descriptive sentences across
web snippets, especially sentences that match widespread definition patterns.
This is a side, but instrumental study to the core of this thesis, as it is
necessary for systems targeted at the Internet to
develop effective crawling techniques. On the contrary,
chapter three has two goals: (a) presenting some components used by
the strategies outlined in the last three chapters, this way
helping to focus on key aspects of the ranking methodologies, and hence to clearly present the relevant aspects of
approaches laid out in these three
chapters; and (b) fleshing out some characteristics that
make separating the genuine from the misleading answer candidates difficult; particularly,
across sentences matching definition patterns. Chapter three is helpful
for understanding part of the linguistic phenomena that the posterior chapters
On a final note about the organisation of this thesis,
since there is a myriad of techniques,
chapter six and seven
start dissecting the related work closer to each strategy.
The main contribution of each chapter begins at section 6.5 and 7.6, respectively.
These two sections start with a discussion and comparison between the proposed
methods and the related work presented in their corresponding preceding sections.
This organisation is directed at facilitating the contextualisation of the
proposed approaches as there are different question answering systems with