NEWS: Tiger-based POS-tagger & Chunker for German using
opennlp tools (see here for
The goal in the software project is the development of a
Named Entity Recognizer (NER), that consists of a set of different
Machine-Learning based NER engines and a META-NER. The task
of the META-NER is the combination of the results of the several
specific NER engines. More precisely, the goals of the software project
- Common platform for different
- Multilingual (DE and EN)
- Different corpora
- BioInformatics (Gene names, …)
- Meta control engine
- Combination of results from different
- Automatic error/bidding analysis as
basis for active learning
Here is a blueprint of the META-NER engine:
The input is a set of text documents and the output is a set of
correspesonding documents where all recognized Named Entities are
tagged with the tag ENAMEX. The META-NER receives identified NE from
the following specific NER engines:
- GNR: a seed-based recoginzer for NEs, which has been developed at
DFKI as part of a Master Thesis
- BiQue-NER: a NER recognized based on Collins and Singer, which has
also been developed at DFKI.
- CRF-based NER: a NER based on Conditonalized Random Fields.
- OpenNLP NER: the NER engine of the opennlp project
During the software project, these NER engines have to be extended and
adapted to run for English and German, and for the foressen domains.
Additionaly, we will also integrate:
to be used:
- LingPipe: a widely used open source NER recognizer. We
this one as our reference system. At DFKI we have learned NE models for
English and German, so we will integrate LingPipe just as it is.
Now, some more
information, links and references about the specific NER engines
A NER for recognizing generalized
Named Entities, that was developed by
Christian Spurk as part of his Master Thesis.
Tasks to do: adapt to new
corpora formats, adapt to German; training and testing
Responsible students: Urbain
Atsague Tchino and Youness El Ouafy
NER based on the boostrapping version of Collins
and Singer. BiQue-NER has already been extended in the following
- Typed gazetters
- Chunk parsing instead of full parsing
- Software package: is placed
ask me for current version
- Documentation about the BiQueNER:
to do: adapt
to new corpora formats; adapt to chunker of opennlp tools; adapt to German; performance
check of training module
students: Margarita Kovacheva and Lourdes Lara
NER based on Conditionalized
Random Fields (CRF) using the Machine Learning
Toolkit Mallet. CRF are already used widely for NER. But the
purpose of the software project is to learn how to develop, test and
evaluate advanced self-made software. Note that the CRF-based NER will
be a supervised one, so careful checking of the corpora format and
annotation is important.
Tasks to do: adapt
to the used corpora formats; adapt to POS tagger or chunker
of the Mallet tools or opennlp
students: Benjamin Roth and Joo-Eun Feit
point here is the NER of the opennlp
tool. This NER already comes with a model for English.
Tasks to do: adapt to the used corpora
students: Ekaterina Biehl and Kirill Slutsky and
The group has developed a first version of a POS tagger &
Chunker for German based
Tiger corpus. It can be downloaded from http://alexander.volokh.de/maps/projekt.zip
Note that the performance of both tools still can be improved and
will be so, so check regularly for new updates.
This will probably not be created during this software project, but by
one of my Master students. Probably, we will use stacking approach, cf.
"Issues in Stacked Generalization" by Ting
and Witten, 1998.
Some more information about Machine Learning for Named