DFKI-LT - From UBGs to CFGs A Practical Corpus-Driven Approach
From UBGs to CFGs A Practical Corpus-Driven Approach
1 Natural Language Engineering volume 13 number 4,
Published online in April 2006
We present a simple and intuitive unsound corpus-driven approximation method for turning unification-based grammars, such as HPSG, CLE, or PATR-II into context-free grammars (CFGs). Our research is motivated by the idea that we can exploit (large-scale), hand-written unification grammars not only for the purpose of describing natural language and obtaining a syntactic structure (and perhaps a semantic form), but also to address several other very practical topics. Firstly, to speed up deep parsing by having a cheap recognition pre-filter (the approximated CFG). Secondly, to obtain an indirect stochastic parsing model for the unification grammar through a trained PCFG, obtained from the approximated CFG. This gives us an efficient disambiguation model for the unification-based grammar. Thirdly, to generate domain-specific subgrammars for application areas such as information extraction or question answering. And finally, to compile context-free language models which assist the acoustic model of a speech recognizer. The approximation method is unsound in that it does not generate a CFG whose language is a true superset of the language accepted by the original unification-based grammar. It is a corpus-driven method in that it relies on a corpus of parsed sentences and generates broader CFGs when given more input samples. Our open approach can be fine-tuned in different directions, allowing us to monotonically come close to the original parse trees by shifting more information into the context-free symbols. The approach has been fully implemented in JAVA.