The project aims at designing, implementing, investigating and evaluating a new system architecture that facilitates the combination of different language technologies for a range of practical applications. Language technologies offer numerous means for a partial analysis of texts that can be employed for information retrieval, information extraction, language checking, and many other applications. Processing methods and tools differ along several dimensions, e.g., wrt. levels of linguistic description, depth of analysis, or the way knowledge of language is derived (linguistically or statistically). Methods often overlap in their functionality but differ in their strengths and weaknesses. Finding optimal combinations of heterogeneous techniques and processing components is one of the most difficult tasks in language processing - the challenge of the WHITEBOARD project. The novel architecture to be developed and explored in WHITEBOARD is based on the concept of an annotated text. The different LT components enrich an XML-encoded text with layers of new meta-information that are also represented in XML. Each component can exploit or disregard previously assigned annotations. The WHITEBOARD architecture has a single shared data structure, which at the same time is the input, throughput, and output of the system. The envisaged architecture permits the pragmatic combination of different processing approaches, most notably novel ways of the combination of shallow and deep methods.
- WHITEBOARD will be built on top of existing DFKI LT components: the morphological processing system MORPHIX, tagger and phrase parsers TnT and Chunkie, the information extraction system SMES, the efficient HPSG parsing system PET, HPSG Grammars for German, English (Stanfords Lingo Grammar) and Japanese, the controlled language checking system FLAG.
- Two applications are realized for the purpose of evaluating and demonstrating the results. One application is information extraction. As the automatic understanding of entire texts will remain outside of reach for quite some time, the strategy to approach this goal is the gradual improvement of our IE technology.
- The second application is controlled language checking. Here again, we cannot expect from today's technology a comprehensive and correct analysis of an entire text. We might be able, however, to specialize our deep analysis in such a way that it can apply a deep analysis with sufficient precision in certain environments that are critical for the correct diagnosis and correction of errors.