DFKI-LT - A Shallow Text Processing Core Engine

GŁnter Neumann, Jakub Piskorski
A Shallow Text Processing Core Engine
2 Journal of Computational Intelligence volume 18 number 3, City, 2002
 
In this paper we present SPPC, a high-performance system for intelligent extraction of structured data from free text documents. SPPC consists of a set of domain-adaptive shallow core components that are realized by means of cascaded weighted finite state machines and generic dynamic tries. The system has been fully implemented for German; it includes morphological and on-line compound analysis, effcient POS-filtering, high performance named entity recognition and chunk parsing based on a novel divideand-conquer strategy. The whole approach proved to be very useful for processing free word order languages like German. SPPC has a good performance (more than 6000 words per second on standard PC environments) and achieves high linguistic coverage, especially for the divide-and-conquer parsing strategy, where we obtained an f-measure of 87.14% on unseen data.
 
Files: BibTeX, compintell.pdf