In: KDML 2010: Knowledge Discovery, Data Mining, and Machine Learning. GI-Workshop-Tage "Lernen, Wissen, Adaption" (LWA-2010), located at LWA, October 4-6, Kassel, Germany, Universität Kassel, 2010.
Zusammenfassung
The identification of noun groups in text is a well
researched task and serves as a pre-step for other
natural language processing tasks, such as the extraction
of keyphrases or technical terms. We
present a first version of a noun group chunker
that, given an unannotated text corpus, adapts itself
to the domain at hand in an unsupervised
way. Our approach is inspired by findings from
cognitive linguistics, in particular the division of
language into open-class elements and closedclass
elements. Our system extracts noun groups
using lists of closed-class elements and one linguistically
inspired seed extraction rule for each
open class. Supplied with raw text, the system
creates an initial validation set for each open
class based on the seed rules and applies a bootstrapping
procedure to mutually expand the set of
extraction rules and the validation sets. Possibly
domain-dependent information about open-class
elements, as for example provided by a part-of
speech lexicon, is not used by the system in order
to ensure the domain-independency of the approach.
Instead, the system adapts itself automatically
to the domain of the input text by bootstrapping
domain-specific validation lists. An
evaluation of our system on the Wall Street Journal
training corpus used for the CONLL 2000
shared task on chunking shows that our bootstrapping
approach can be successfully applied
to the task of noun group chunking.