Information Extraction


Computational Lingustics and Computer Science
Master Programme
Winter Semester 2011



The re-exam will take place on 23 April from 12 -14 Uhr (noon to 2 pm) in the seminar room of the Coli building (C72, ground floor).

Note also, that only registered students can participate. Registration deadline is 31 March 2012. Students that cannot register via the HISPOS system because they have no account, i.e., Erasmus students, have to send me an email or written letter until the 31 March, 2012, that stays that they are bindingly will participate in the exam ! Otherwise, participation is not possible!

The certificates for the exam (for CS students only) can be fetched at the office of Prof. Uszkoreit, room 3.09, building C72. Opening time is from Monday to Thursday in the morning.


General Information

Reader Günter Neumann


In this course we will consider methods and strategies for Information Extraction (IE) from unstructured (text-based) and semi-structured (Web-based) sources. We will mainly consider Machine Lerning (ML) based approaches to IE with a focus on semi-supervised and unsupervised methods. We will start by presenting an overview of IE and discuss the notion of information in the context of IE. We then present recent methods for Named Entity recognition and Entity set expansion using different ML algorithms (supervised and semi-supervised) and different granularities of linguistic features (from parsing to character level features). We then introduce the terms Machine Reading and Open Information Extraction (OpenIE), and present different approaches to unsupervised relation extraction in the area of OpenIE.

Time and location: Each Thursday, 12 am to 2 pm, room 1.17 building C74.

Lecture Language: English

Available Certificate Modalities:

Placement in Study Programme:



Session Number
Organizational matters, and introduction into the field
Information Extraction
Architecture and Task Definition, Linguistic Feature Extraction
Named Entity Extraction (NER)
problem definition, Gazetters, rule-based NER; Machine Learning, CoNLL shared task for NER, feature extraction for NER
Named Entity Extraction (NER)
Supervised Approaches to NE recognition, HMM, MEM; Overview of Bootstrapping of NE-Lists
Co-Training for NER
Co-training algorithms, decision lists, spelling rules, context rules
Generalized Names Recognition and Context Pattern Induction Method
Algorithm Nomen, generalized names, positive and negative seed examples, ranking patterns and entities; context patterns, context trigger-words, k-reversible grammars
Entity Set Expansion
entity set expansion, co-occurrence statistics, learning membership function, vector space model, identification of enumerations, Wikipedia-based evaluation
Entity Set Refinement and Semantic Labeling of Entity Sets
removing errors from expanded sets, user relevance feedback, distributional hyphothesis, method SIM and FMM, approximate cosine similarity, automatic labeling of clusters, TFIDF-bases strategy for semantic labeling
Machine Reading and Open Information Extraction
Machine Reading, Traditional IE versus Open IE; self-supervised learning, TextRunner, Relation Extraction, synonymy.
Open IE: TextRunner
system TextRunner, extraction of binary relations, self-training, single-pass, estimating the correctness of extracted facts, relation synonymy
Open IE: WOE
system WOE, self-supervised training Wikipedia, infoboxes, shallow features, dependency-parse feature, corePath, generalized-corePath
Open IE: ReVerb and Hybrid Information Extraction
system ReVerb, incoherent and uninformative extraction, lexical constraints, relation-driven extraction, comparison of TextRunner, WOE and ReVerb; Hybrid Information Extraction; examples for NER, REE


D. Appelt and D. Israel (1999). Introduction to Information Extraction. Script for a tutorial held at the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden.
Keywords: information extraction, natural language tasks, Message Understanding Conference (MUC), rule-based IE systems, finite-state parsing, FASTUS system, corference in IE, template merging.
Relevant session: 2

M. Banko, M. Cafarella, S. Soderland, M. Broadhead and O. Etzioni (2007). Open Information Extraction from the Web. Twentieth International Joint Conference on Artificial Intelligence (IJCAI-07), Hyderabad, India.
Keywords: Open Information Extraction, textRunner, self-supervised learning.
Relevant session: 9, 10

D. M. Bikel, R. Schwartz and R. M. Weischedel (1999). An Algorithm that Learns What's in a Name. Machine Learning 34(1-3): 211-231.
Keywords: named entity extraction, hidden Markov models (HMM).
Relevant session: 4

A. Borthwick (1999). A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University.
Keywords: named entity extraction, Maximum Entropy Modeling (MEM).
Relevant session: 4

Colin de la Higuera (2006). Learning K-Reversible Languages. Slides.
Keywords: k-reversible languages, learning.
Relevant session: 6

C. Chang, M. Kayed, M. Girgis and K. Shaalan (2006). A Survey of Web Information Extraction Systems. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Vol. 18, No. 10, pp. 1411-1428.
Keywords: semi-structured documents, Web Mining, Wrapper, Wrapper induction, classification of IE systems.
Relevant session: 1, 2

M. Collins and Y. Singer (1999). Unsupervised Models for Named Entity Classification. EMNLP/VLC-99.
Keywords: Co-training algorithm for NER, bootstrapping, seed-based NER, spelling and context rules.
Relevant session: 5

J. Cowie and Y. Wilks (2000). Information Extraction. R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker.
Keywords: MUC-systems, evaluation and template design, generic IE system.
Relevant session: 2

Benjamin Van Durme, Marius Pasca (2008). Finding Cars, Goddesses and Enzymes: Parametrizable Acquisition of Labeled Instances for Open-Domain Information Extraction In Proceedings of the 23rd Annual Conference on Artificial Intelligence (AAAI-2008)
Keywords: automatic labeling of entity groups, Filtering of is-a extraction pairs through distributionally similar terms, TFIDF-based strategy.
Relevant session: 8

W. Lin, R. Yangarber, R. Grishman (2003). Bootstrapped Learning of Semantic Classes from Positive and Negative Examples. 20th International Conference on Machine Learning: ICML 2003 Workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining.
Keywords: Algorithm NOMEN, positive and negative examples.
Relevant session: 6

A. Fader, S. Soderland and O. Etzioni (2011). Identifying Relations for Open Information Extraction Proceedings of the Conference of Empirical Methods in Natural Language Processing (EMLLP-11).
Keywords: system ReVerb, incoherent and uninformative extraction, lexical constraints, relation-driven extraction, comparison of TextRunner, WOE and ReVerb
Relevant session: 12

L. Floridi (2005). Is Information Meaningful Data? Philosophy and Phenomenological Research, 2005, 70.2, 351-370.
Keywords: standard definition of information, pseudo information, GATE system.
Relevant session: 1

P. McNamee and H.T. Dang (NIST) (2009). Overview of the TAC 2009 Knowledge Base Population Track. Textual Analysis Conference 2009.
Relevant session: 2

G. Neumann and J. Piskorski (2002). A Shallow Text Processing Core Engine. Journal Computational Intelligence, Volume 18, Number 3, 2002, pages 451-476.
Keywords: German IE system, Natural Language as normalization, divide-and-conquer parsing.
Relevant session: 3

G. Neumann (2006). A Hybrid Machine Learning Approach for Information Extraction from Free Texts. From Data and Information Analysis to Knowledge Engineering, Springer series: Studies in Classification, Data Analysis, and Knowledge Organization, pages 390-397, Springer-Verlag Berlin, Heidelberg, New-York.
Keywords: Hybrid Information Extraction; combing MEM and DOP (data-oriented parsing)
Relevant session: 12

G. Neumann (2009). Text-basiertes Informationsmanagement. Carstensen, K.-U.; C. Ebert; C. Endriss; S. Jekat; R. Klabunde & H. Langer (Hrsg.) (2009) Computerlinguistik und Sprachtechnologie. Eine Einführung. 3., überarbeitete und erweiterte Auflage, Heidelberg: Spektrum Akademischer Verlag, Pages 576-615.
Keywords: Information Retrieval, Named Entity recognition, relation extraction, question answering, text summarization.
Relevant session: 1, 2

D. Nadeau and S. Sekine (2007). A survey of named entity recognition and classification. Journal of Linguisticae Investigationes 30:1.
Relevant session: 3

L. Sarmento, V. Jijkuon, M. de Rijke and E. Oliveria (2007). "More like these": growing entity classes from seeds. In Proceedings of CIKM-07. pp. 959-962. Lisbon, Portugal.
Keywords: entity set expansion, co-occurrence statistics, learning membership function, vector space model, identification of enumerations, Wikipedia-based evaluation.
Relevant session: 7

G. Sigletos, G. Paliouras, C. D. Spyropoulos, and M. Hatzopoulos (2005). Combining Information Extraction Systems Using Voting and Stacked Generalization. In Journal of Machine Learning Research, Vol. 6 (2005) , p. 1751-1782.
Keywords: Hybrid Information Extraction
Relevant session: 12

P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira (2006). A Context Pattern Induction Method for Named Entity Extraction. Tenth Conference on Computational Natural Language Learning (CoNLL-X).
Keywords: context pattern induction, trigger words, learning k-reversible grammars.
Relevant session: 6

A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland (2007) TextRunner: Open Information Extraction on the Web. HLT-NAACL (Demonstrations) 2007.
Keywords: unsupervised relation extraction from WikiPedia, dendency patterns, clustering.
Relevant session: 9, 10

Vishnu Vyas and Patrick Pantel (2009) Semi-Automatic Entity Set Refinement. In Proceedings of North American Association for Computational Linguistics / Human Language Technology (NAACL/HLT-09).
Keywords: refinement of expanded sets, user relevance feedback, learning to recognize false positives, SIM and FMM methods, approximate cosine similarity.
Relevant session: 8

F. Wu, R. Hoffmann, and D. S. Weld (2007). Information Extraction from Wikipedia: Moving Down the Long Tail. 14th ACM SigKDD International Conference on Knowledge Discovery and Data Mining (KDD-08), Las Vegas, NV, August 24-27, 2008 .
Keywords: system Kylin, self-supervision, computing positive training examples from Infoboxes.
Relevant session: 11

F. Wu and D. S. Weld (2010). Open Information Extraction using Wikipedia. The Annual Meeting of the Association for Computational Linguistics (ACL-10), 2010.
system WOE, Novel self-supervision training method, using and comparing shallow linguistics featuers and dependency-parse features, comparison with TextRunner
Relevant session: 11