Information Extraction


Computational Lingustics and Computer Science
Master Programme
Winter Semester 2010



This is for the CS students of my course: the signed certificates ("benoteter Schein") of the exam (the re-exam not yet) can be picked up at the examination office ("Prüfungssekretariat") of the CL departement.


General Information

Reader Günter Neumann


In this course we will consider methods and strategies for Information Extraction (IE) from unstructured (text-based) and semi-structured (Web-based) sources. We will start by presenting an overview of IE. We will then present recent methods for Named Entity identification and recognition, relation extraction, mining meaning from Wikipedia, and Machine Reading with a focus on Machine Learning approaches.

Time and location: Each Friday, 10 am to 12 am, room 2.11, building C7.2.

Lecture Language: English

Available Certificate Modalities:

Placement in Study Programme:



Session Number
Introduction into the field, NL as normalization, standard definition of information
Architecture and Task Definition
problem definition, Gazetters, rule-based NER; Machine Learning, CoNLL shared task for NER, feature extraction for NER, HMM-based NER
MEM-based NER, Co-training
No Lecture
I have a project review meeting!
Generalized names, algorithm NOMEN, induction of context patterns, trigger words, learning k-reversible grammar
Closed class vs. open class words, Noung group recognition, algorithm ccChunk, technical term extraction, seed context rules, bootstrapping, positive and negative validation set, recognition of protein names, AIMED corpus
Wikipedia as a resource for mining meaning, NLP tasks in which Wikipedia has been used, Semantic Relatedness, Wikirelate, Explicit Semantic Analysis (ESA), Wikipedia hyperlinks, Word Sense Disambiguation, Wikification, Learning to link in Wikipedia, Learning to detect links
Creating large NE gazetters from Wikipedia; Cosine similarity; Linking WordNet and Wikipedia; Relation Extraction; Seed-based Rellation Extraction from free text with SNOWBALL
PORE: Positive-Only Relation Extraction from Wikipedia Text; Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web; Patterns from Dependency Parsing: Clustering
Automatic extraction of seeds for REE; WordNet; Infoboxes; system Kylin; Introduction into ontology extracion from WikiPedia
Yago, DBPedia, EMLR; shallow parsing of Wikipedia's category labels, SPARQL, extracting relations from category system
Machine Reading, Open IE, Self-supervised learning, TextRunner, PMI
Final lecture


E. Agichtein, L. Gravano: Snowball: extracting relations from large plain-text collections. ACM DL 2000: 85-94.
Keywords: Relation extraction from free text, seed-based approach, Clustering of patterns, automatic pattern evaluation, automatic tuple evaluation and filtering, estimate precision by sampling.

D. Appelt and D. Israel (1999). Introduction to Information Extraction. Script for a tutorial held at the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden.
Keywords: information extraction, natural language tasks, Message Understanding Conference (MUC), rule-based IE systems, finite-state parsing, FASTUS system, corference in IE, template merging.

M. Banko, M. Cafarella, S. Soderland, M. Broadhead and O. Etzioni (2007). Open Information Extraction from the Web. Twentieth International Joint Conference on Artificial Intelligence (IJCAI-07), Hyderabad, India.
Keywords: Open Information Extraction, textRunner, self-supervised learning.

D. M. Bikel, R. Schwartz and R. M. Weischedel (1999). An Algorithm that Learns What's in a Name. Machine Learning 34(1-3): 211-231.
Keywords: named entity extraction, hidden Markov models (HMM).

A. Borthwick (1999). A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University.
Keywords: named entity extraction, Maximum Entropy Modeling (MEM).

C. Chang, M. Kayed, M. Girgis and K. Shaalan (2006). A Survey of Web Information Extraction Systems. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Vol. 18, No. 10, pp. 1411-1428.
Keywords: semi-structured documents, Web Mining, Wrapper, Wrapper induction, classification of IE systems.

M. Collins and Y. Singer (1999). Unsupervised Models for Named Entity Classification. EMNLP/VLC-99.
Keywords: Co-training algorithm for NER, bootstrapping, seed-based NER, spelling and context features.

J. Cowie and Y. Wilks (2000). Information Extraction. R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker.
Keywords: MUC-systems, evaluation and template design, generic IE system.

K. Eichler, H. Hemsen, and G. Neumann (2009). Unsupervised and domain-independent extraction of technical terms from scientific articles in digital libraries. In proceedings of the Workshop Information Retrieval 2009 organized as part of LWA, Darmstadt.
Keywords: closed-class context patterns for noun group recognition, ccChunk version 1, identification and extraction of technical terms, search engine frequency counts, PMI/IR-based approach, DBPedia for automatic categorization.

K. Eichler and G. Neumann (2010). Bootstrapping Noun Groups Using Closed-Class Elements Only. In KDML 2010: Knowledge Discovery, Data Mining, and Machine Learning, Kassel, Germany.
Keywords: ccChunk version 2, automatic domain adaptation, seed context patterns, positive and negative examples, bootstrapping domain-specific validation lists, CONLL 2000 evaluation data.

W. Lin, R. Yangarber, R. Grishman (2003). Bootstrapped Learning of Semantic Classes from Positive and Negative Examples. 20th International Conference on Machine Learning: ICML 2003 Workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining.
Keywords: Algorithm NOMEN, positive and negative examples.

L. Floridi (2005). Is Information Meaningful Data? Philosophy and Phenomenological Research, 2005, 70.2, 351-370.
Keywords: standard definition of information, pseudo information, GATE system.

P. McNamee and H.T. Dang (NIST) (2009). Overview of the TAC 2009 Knowledge Base Population Track. Textual Analysis Conference 2009.

O. Medelyan, D. Milne, C. Legg, I. H. Witten (2008). Mining Meaning From Wikipedia. Working Paper: 11/2008, September 2008, University of Waikato, New Zealand.
Keywords: Introduction of Wikipedia as a knowledge extraction resource; Yago, DBPedia, EMLR

G. Neumann and J. Piskorski (2002). A Shallow Text Processing Core Engine. Journal Computational Intelligence, Volume 18, Number 3, 2002, pages 451-476.
Keywords: German IE system, Natural Language as normalization, divide-and-conquer parsing.

G. Neumann (2009). Text-basiertes Informationsmanagement. Carstensen, K.-U.; C. Ebert; C. Endriss; S. Jekat; R. Klabunde & H. Langer (Hrsg.) (2009) Computerlinguistik und Sprachtechnologie. Eine Einführung. 3., überarbeitete und erweiterte Auflage, Heidelberg: Spektrum Akademischer Verlag, Pages 576-615.
Keywords: Information Retrieval, Named Entity recognition, relation extraction, question answering, text summarization.

D. Nadeau and S. Sekine (2007). A survey of named entity recognition and classification. Journal of Linguisticae Investigationes 30:1.

P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira (2006). A Context Pattern Induction Method for Named Entity Extraction. Tenth Conference on Computational Natural Language Learning (CoNLL-X).
Keywords: context pattern induction, trigger words, learning k-reversible grammars.

Y. Yan, N. Okazaki, Y. Matsuo, Z. Yang, M. Ishizuka (2009). Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009.
Keywords: unsupervised relation extraction from WikiPedia, dendency patterns, clustering.

G. Wang, Y. Yu, and H. Zhu (2007). PORE: Positive-Only Relation Extraction from Wikipedia Text. Lecture Notes in Computer Science, 2007, Volume 4825/2007, 580-594, DOI: 10.1007/978-3-540-76298-0_42 .
Keywords: relation extraction from WikiPedia, system PORE, Positive-Only Relation Extraction, bootstrapping.

F. Wu, R. Hoffmann, and D. S. Weld (2007). Information Extraction from Wikipedia: Moving Down the Long Tail. 14th ACM SigKDD International Conference on Knowledge Discovery and Data Mining (KDD-08), Las Vegas, NV, August 24-27, 2008 .
Keywords: system Kylin, self-supervision, computing positive training examples from Infoboxes.