Information Extraction

Lecture
Computational Lingustics and Computer Science
Master Programme
Winter Semester 2011

News

The re-exam will take place on 23 April from 12 -14 Uhr (noon to 2 pm) in the seminar room of the Coli building (C72, ground floor).

General Information

In this course we will consider methods and strategies for Information Extraction (IE) from unstructured (text-based) and semi-structured (Web-based) sources. We will mainly consider Machine Lerning (ML) based approaches to IE with a focus on semi-supervised and unsupervised methods. We will start by presenting an overview of IE and discuss the notion of information in the context of IE. We then present recent methods for Named Entity recognition and Entity set expansion using different ML algorithms (supervised and semi-supervised) and different granularities of linguistic features (from parsing to character level features). We then introduce the terms Machine Reading and Open Information Extraction (OpenIE), and present different approaches to unsupervised relation extraction in the area of OpenIE.

Schedule

Session Number

Date

Topic

Comments

27.10.2011

Overview

Organizational matters, and introduction into the field

03.11.2011

Information Extraction

Architecture and Task Definition, Linguistic Feature Extraction

10.11.2011

Named Entity Extraction (NER)

problem definition, Gazetters, rule-based NER; Machine Learning, CoNLL shared task for NER, feature extraction for NER

17.11.2011

Named Entity Extraction (NER)

Supervised Approaches to NE recognition, HMM, MEM; Overview of Bootstrapping of NE-Lists

24.11.2011

Co-Training for NER

Co-training algorithms, decision lists, spelling rules, context rules

01.12.2011

Generalized Names Recognition and Context Pattern Induction Method

Algorithm Nomen, generalized names, positive and negative seed examples, ranking patterns and entities; context patterns, context trigger-words, k-reversible grammars

08.12.2011

Entity Set Expansion

entity set expansion, co-occurrence statistics, learning membership function, vector space model, identification of enumerations, Wikipedia-based evaluation

15.12.2011

Entity Set Refinement and Semantic Labeling of Entity Sets

removing errors from expanded sets, user relevance feedback, distributional hyphothesis, method SIM and FMM, approximate cosine similarity, automatic labeling of clusters, TFIDF-bases strategy for semantic labeling

22.12.2011

Machine Reading and Open Information Extraction

Machine Reading, Traditional IE versus Open IE; self-supervised learning, TextRunner, Relation Extraction, synonymy.

12.01.2012

Open IE: TextRunner

system TextRunner, extraction of binary relations, self-training, single-pass, estimating the correctness of extracted facts, relation synonymy

19.01.2012

Open IE: WOE

system WOE, self-supervised training Wikipedia, infoboxes, shallow features, dependency-parse feature, corePath, generalized-corePath

26.01.2012

Open IE: ReVerb and Hybrid Information Extraction

system ReVerb, incoherent and uninformative extraction, lexical constraints, relation-driven extraction, comparison of TextRunner, WOE and ReVerb; Hybrid Information Extraction; examples for NER, REE

02.02.2012

Summary

summary

References

M. Banko, M. Cafarella, S. Soderland, M. Broadhead and O. Etzioni (2007). Open Information Extraction from the Web. Twentieth International Joint Conference on Artificial Intelligence (IJCAI-07), Hyderabad, India.
Keywords: Open Information Extraction, textRunner, self-supervised learning.
Relevant session: 9, 10

D. M. Bikel, R. Schwartz and R. M. Weischedel (1999). An Algorithm that Learns What's in a Name. Machine Learning 34(1-3): 211-231.
Keywords: named entity extraction, hidden Markov models (HMM).
Relevant session: 4

Colin de la Higuera (2006). Learning K-Reversible Languages. Slides.
Keywords: k-reversible languages, learning.
Relevant session: 6

C. Chang, M. Kayed, M. Girgis and K. Shaalan (2006). A Survey of Web Information Extraction Systems. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, Vol. 18, No. 10, pp. 1411-1428.
Keywords: semi-structured documents, Web Mining, Wrapper, Wrapper induction, classification of IE systems.
Relevant session: 1, 2

M. Collins and Y. Singer (1999). Unsupervised Models for Named Entity Classification. EMNLP/VLC-99.
Keywords: Co-training algorithm for NER, bootstrapping, seed-based NER, spelling and context rules.
Relevant session: 5

J. Cowie and Y. Wilks (2000). Information Extraction. R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker.
Keywords: MUC-systems, evaluation and template design, generic IE system.
Relevant session: 2

W. Lin, R. Yangarber, R. Grishman (2003). Bootstrapped Learning of Semantic Classes from Positive and Negative Examples. 20th International Conference on Machine Learning: ICML 2003 Workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining.
Keywords: Algorithm NOMEN, positive and negative examples.
Relevant session: 6

A. Fader, S. Soderland and O. Etzioni (2011). Identifying Relations for Open Information Extraction Proceedings of the Conference of Empirical Methods in Natural Language Processing (EMLLP-11).
Keywords: system ReVerb, incoherent and uninformative extraction, lexical constraints, relation-driven extraction, comparison of TextRunner, WOE and ReVerb
Relevant session: 12

L. Floridi (2005). Is Information Meaningful Data? Philosophy and Phenomenological Research, 2005, 70.2, 351-370.
Keywords: standard definition of information, pseudo information, GATE system.
Relevant session: 1

G. Neumann and J. Piskorski (2002). A Shallow Text Processing Core Engine. Journal Computational Intelligence, Volume 18, Number 3, 2002, pages 451-476.
Keywords: German IE system, Natural Language as normalization, divide-and-conquer parsing.
Relevant session: 3

G. Neumann (2006). A Hybrid Machine Learning Approach for Information Extraction from Free Texts. From Data and Information Analysis to Knowledge Engineering, Springer series: Studies in Classification, Data Analysis, and Knowledge Organization, pages 390-397, Springer-Verlag Berlin, Heidelberg, New-York.
Keywords: Hybrid Information Extraction; combing MEM and DOP (data-oriented parsing)
Relevant session: 12

G. Neumann (2009). Text-basiertes Informationsmanagement. Carstensen, K.-U.; C. Ebert; C. Endriss; S. Jekat; R. Klabunde & H. Langer (Hrsg.) (2009) Computerlinguistik und Sprachtechnologie. Eine Einführung. 3., überarbeitete und erweiterte Auflage, Heidelberg: Spektrum Akademischer Verlag, Pages 576-615.
Keywords: Information Retrieval, Named Entity recognition, relation extraction, question answering, text summarization.
Relevant session: 1, 2

L. Sarmento, V. Jijkuon, M. de Rijke and E. Oliveria (2007). "More like these": growing entity classes from seeds. In Proceedings of CIKM-07. pp. 959-962. Lisbon, Portugal.
Keywords: entity set expansion, co-occurrence statistics, learning membership function, vector space model, identification of enumerations, Wikipedia-based evaluation.
Relevant session: 7

P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira (2006). A Context Pattern Induction Method for Named Entity Extraction. Tenth Conference on Computational Natural Language Learning (CoNLL-X).
Keywords: context pattern induction, trigger words, learning k-reversible grammars.
Relevant session: 6

A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, O. Etzioni, S. Soderland (2007) TextRunner: Open Information Extraction on the Web. HLT-NAACL (Demonstrations) 2007.
Keywords: unsupervised relation extraction from WikiPedia, dendency patterns, clustering.
Relevant session: 9, 10

Vishnu Vyas and Patrick Pantel (2009) Semi-Automatic Entity Set Refinement. In Proceedings of North American Association for Computational Linguistics / Human Language Technology (NAACL/HLT-09).
Keywords: refinement of expanded sets, user relevance feedback, learning to recognize false positives, SIM and FMM methods, approximate cosine similarity.
Relevant session: 8

F. Wu, R. Hoffmann, and D. S. Weld (2007). Information Extraction from Wikipedia: Moving Down the Long Tail. 14th ACM SigKDD International Conference on Knowledge Discovery and Data Mining (KDD-08), Las Vegas, NV, August 24-27, 2008 .
Keywords: system Kylin, self-supervision, computing positive training examples from Infoboxes.
Relevant session: 11

F. Wu and D. S. Weld (2010). Open Information Extraction using Wikipedia. The Annual Meeting of the Association for Computational Linguistics (ACL-10), 2010.
system WOE, Novel self-supervision training method, using and comparing shallow linguistics featuers and dependency-parse features, comparison with TextRunner
Relevant session: 11