Software project: Data-driven web-based Question Answering
Summer semester 2007, supervised by PD Dr. Günter Neumann


NEWS: Tiger-based POS-tagger & Chunker for German using opennlp tools (see here for more details)

The goal in the software project is the development of a hybrid Named Entity Recognizer (NER), that consists of a set of different Machine-Learning based NER engines and a META-NER. The task  of the META-NER  is the combination of the results of the several specific NER engines. More precisely, the goals of the software project are:

Here is a blueprint of the META-NER engine:



The input is a set of text documents and the output is a set of correspesonding documents where all recognized Named Entities are tagged with the tag ENAMEX. The META-NER receives identified NE from the following specific NER engines:

- GNR: a seed-based recoginzer for NEs, which has been developed at DFKI as part of a Master Thesis
- BiQue-NER: a NER recognized based on Collins and Singer, which has also been developed at DFKI.
- CRF-based NER: a NER based on Conditonalized Random Fields.
- OpenNLP NER: the NER engine of the opennlp project

During the software project, these NER engines have to be extended and adapted to run for English and German, and for the foressen domains.

Additionaly, we will also integrate:

Corpora to be used:

Now, some  more information, links and references about the specific NER engines

GNR:

A NER for recognizing generalized Named Entities, that was developed by Christian Spurk as part of his Master Thesis.
Web page of GNR (includes: software package, API, relevant literature, etc.)
Christians Master Thesis (in German)
Christians Slide presentation ("How to work with GNR?")

Tasks to do: adapt to new corpora formats, adapt to German; training and testing
Responsible students: Urbain Atsague Tchino and Youness El Ouafy

BiQueNER:

A NER based on the boostrapping version of Collins and Singer. BiQue-NER has already been extended in the following directions:
Tasks to do: adapt to new corpora formats; adapt to chunker of opennlp tools; adapt to German; performance check of training module
Responsible students: Margarita Kovacheva and Lourdes Lara Tapia


CRF-based NER

A NER based on Conditionalized Random Fields (CRF) using the Machine Learning Toolkit Mallet. CRF are already used widely for NER. But the purpose of the software project is to learn how to develop, test and evaluate advanced self-made software. Note that the CRF-based NER will be a supervised one, so careful checking of the corpora format and annotation is important.

Tasks to do:
adapt to the used corpora formats; adapt to POS tagger or chunker of the Mallet tools or opennlp tools; adapt to German.
Responsible students: Benjamin Roth and Joo-Eun Feit

Opennlp NER

Starting point here is the NER of the opennlp tool. This NER already comes with a model for English.

Tasks to do: adapt to the used corpora formats; adapt to German.
Responsible students: Ekaterina Biehl and Kirill Slutsky and Alexander Volokh

Tiger-based POS-tagger available

The group has developed a first version of a POS tagger & Chunker for German based on the
Tiger corpus. It can be downloaded from http://alexander.volokh.de/maps/projekt.zip
Note that the performance of both tools still can be improved and will be so, so check regularly for new updates.


META-NER:
This will probably not be created during this software project, but by one of my Master students. Probably, we will use stacking approach, cf. "Issues in Stacked Generalization" by Ting and Witten, 1998.


Some more information about Machine Learning for Named Entity Recognition: