Publications
-
Noon Pokoaratsi, Saadullah Aminm, and Günter Neumann (2024)
Towards Understanding Attention-based Reasoning through Graph
Structures in Medical Codes Classification, Proceedings of TextGraphs-17:
Graph-based Methods for Natural Language Processing at ACL-2024
(TextGraphs 2024),
Bangkok, Thailand, 2024.
A common approach to automatically assigning diagnostic and procedural clinical codes to health records is to solve the task as a multi-label classification problem. Difficulties associated with this task stem from domain knowledge requirements, long document texts, large and imbalanced label space, reflecting the breadth and dependencies between medical diagnoses and procedures. Decisions in the healthcare domain also need to demonstrate sound reasoning, both when they are correct and when they are erroneous. Existing works address some of these challenges by incorporating external knowledge, which can be encoded into a graph-structured format. Incorporating graph structures on the output label space or between the input document and output label spaces have shown promising results in medical codes classification. Limited focus has been put on utilizing graph-based representation on the input document space. To partially bridge this gap, we represent clinical texts as graph-structured data through the UMLS Metathesaurus; we explore implicit graph representation through pre-trained knowledge graph embeddings and explicit domain-knowledge guided encoding of document concepts and relational information through graph neural networks. Our findings highlight the benefits of pre-trained knowledge graph embeddings in understanding model's attention-based reasoning. In contrast, transparent domain knowledge guidance in graph encoder approaches is overshadowed by performance loss. Our qualitative analysis identifies limitations that contribute to prediction errors.
- Stalin Varanasi, Muhammad Umer Butt, and Günter Neumann (2023) AutoQIR: Auto-Encoding Questions with Retrieval Augmented Decoding for Unsupervised Passage Retrieval and Zero-shot Question Generation, Proceedings of Recent Advances in Natural Language Processing (RANLP-2023), Bulgaria, 2023.
Dense passage retrieval models have become state-of-the-art for information retrieval on many Open-domain Question Answering (ODQA) datasets. However, most of these models rely on supervision obtained from the ODQA datasets, which hinders their performance in a low-resource setting. Recently, retrieval-augmented language models have been proposed to improve both zero-shot and supervised information retrieval. However, these models have pre-training tasks that are agnostic to the target task of passage retrieval. In this work, we propose Retrieval Augmented Auto-encoding of Questions for zeroshot dense information retrieval. Unlike other pre-training methods, our pre-training method is built for target information retrieval, thereby making the pre-training more efficient. Our method consists of a dense IR model for encoding questions and retrieving documents during training and a conditional language model that maximizes the question’s likelihood by marginalizing over retrieved documents. As a by-product, we can use this conditional language model for zero-shot question generation from documents. We show that the IR model obtained through our method improves the current state-of-the-art of zero-shot dense information retrieval, and we improve the results even further by training on a synthetic corpus created by zero-shot question generation.
- Saadullah Amin, Pasquale Minervini, David Chang, Pontus Stenetorp, and Günter Neumann (2022) MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction, Proceedings of The 29th International Conference on Computational Linguistics (Coling-2022), October 12-17, 2022, Gyeongju, Republic of Korea
Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general domain relation extraction findings to biomedical relation extraction.
- Ioannis Dikeoulias, Saadullah Amin, and Günter Neumann (2022) Temporal Knowledge Graph Reasoning with Low-rank and Model-agnostic Representations , Proceedings of the 7th Workshop on Representation Learning for NLP. ACL-2022, RepL4NLP May 2022, Pages 111-120 ACL 5/2022 (RepL4NLP-2022), May, 2022.
Temporal knowledge graph completion (TKGC) has become a popular approach for reasoning over the event and temporal knowledge graphs, targeting the completion of knowledge with accurate but missing information. In this context, tensor decomposition has successfully modeled interactions between entities and relations. Their effectiveness in static knowledge graph completion motivates us to introduce Time-LowFER, a family of parameter-efficient and time-aware extensions of the low-rank tensor factorization model LowFER. Noting several limitations in current approaches to represent time, we propose a cycle-aware time-encoding scheme for time features, which is model-agnostic and offers a more generalized representation of time. We implement our methods in a unified temporal knowledge graph embedding framework, focusing on time-sensitive data processing. The experiments show that our proposed methods perform on par or better than the state-of-the-art semantic matching models on two benchmarks.
- Saadullah Amin, Noon Pokaratsiri, Morgan Wixted, Alejandro García-Rudolph, Catalina Martínez-Costa, and Günter Neumann (2022) Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts , Proceedings of the 21st Workshop on Biomedical Language Processing. ACL-2022 BioNLP, May 22-27, Pages 200-211 ACL 5/2022. (BioNLP-2022), May, 2022.
Despite the advances in digital healthcare systems offering curated structured knowledge, much of the critical information still lies in large volumes of unlabeled and unstructured clinical texts. These texts, which often contain protected health information (PHI), are exposed to information extraction tools for downstream applications, risking patient identification. Existing works in de-identification rely on using large-scale annotated corpora in English, which often are not suitable in real-world multilingual settings. Pre-trained language models (LM) have shown great potential for cross-lingual transfer in low-resource settings. In this work, we empirically show the few-shot cross-lingual transfer property of LMs for named entity recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke domain. We annotate a gold evaluation dataset to assess few-shot setting performance where we only use a few hundred labeled examples for training. Our model improves the zero-shot F1-score from 73.7% to 91.2% on the gold evaluation set when adapting Multilingual BERT (mBERT) (CITATION) from the MEDDOCAN (CITATION) corpus with our few-shot cross-lingual target corpus. When generalized to an out-of-sample test set, the best model achieves a human-evaluation F1-score of 97.2%.
- Stalin Varanasi, Saadullah Amin and Günter Neumann (2021) AutoEQA: Auto-Encoding Questions for Extractive Question Answering, The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP-2021), Nov. 2021.
There has been a significant progress in the field of extractive question answering (EQA) in the recent years. However, most of them rely on annotations of answer-spans in the corresponding passages. In this work, we address the problem of EQA when no annotations are present for the answer span, i.e., when the dataset contains only questions and corresponding passages. Our method is based on auto-encoding of the question that performs a question answering (QA) task during encoding and a question generation (QG) task during decoding. Our method performs well in a zeroshot setting and can provide an additional loss that boosts performance for EQA.
- Saadullah Amin and Günter Neumann (2021) T2NER: Transformers based Transfer Learning Framework for Name Entity Recognition, The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), demo session, 2021.
Recent advances in deep transformer models have achieved state-of-the-art in several natural language processing (NLP) tasks, whereas named entity recognition (NER) has traditionally benefited from long-short term memory (LSTM) networks. In this work, we present a Transformers based Transfer Learning framework for Named Entity Recognition (T2NER) created in PyTorch for the task of NER with deep transformer models. The framework is built upon the Transformers library as the core modeling engine and supports several transfer learning scenarios from sequential transfer to domain adaptation, multi-task learning, and semi-supervised learning. It aims to bridge the gap between the algorithmic advances in these areas by combining them with the state-of-theart in transformer models to provide a unified platform that is readily extensible and can be used for both the transfer learning research in NER, and for real-world applications. The framework is available at: https://github. com/suamin/t2ner.
- Ekaterina Loginova, Stalin Varanasi and Günter Neumann (2021) Towards End-to-End Multilingual Question Answering . In Journal Information Systems Frontiers, 23(1): 227-241 (2021).
Multilingual question answering (MLQA) is a critical part of an accessible natural language interface. However, current solutions demonstrate performance far below that of monolingual systems. We believe that deep learning approaches are likely to improve performance in MLQA drastically. This work aims to discuss the current state-of-the-art and remaining challenges. We outline requirements and suggestions for practical parallel data collection and describe existing methods, benchmarks and datasets. We also demonstrate that a simple translation of texts can be inadequate in case of Arabic, English and German languages (on InsuranceQA and SemEval datasets), and thus more sophisticated models are required. We hope that our overview will re-ignite interest in multilingual question answering, especially with regard to neural approaches.
Download (This is an earlier version of the paper, but content, experiments and results are the same.)Online version.- Stalin Varanasi, Saadullah Amin, and Günter Neumann (2020) CopyBERT: A Unified Approach to Question Generation with Self-Attention NLP for Conversational AI - Proceedings of the 2nd Workshop, ACL workshop, 2020.
Contextualized word embeddings provide better initialization for neural networks that deal with various natural language understanding (NLU) tasks including Question Answering (QA) and more recently, Question Generation (QG). Apart from providing meaningful word representations, pre-trained transformer models, such as BERT also provide self-attentions which encode syntactic information that can be probed for dependency parsing and POStagging. In this paper, we show that the information from self-attentions of BERT are useful for language modeling of questions conditioned on paragraph and answer phrases. To control the attention span, we use semidiagonal mask and utilize a shared model for encoding and decoding, unlike sequence-tosequence. We further employ copy mechanism over self-attentions to achieve state-of-the-art results for Question Generation on SQuAD dataset.
- Saadullah Amin, Stalin Varanasi, Katherine Dunfield and Günter Neumann (2020) LowFER: Low-rank Bilinear Pooling for Link Prediction. Proceedings of the 37th International Conference on Machine Learning (ICML-2020), 2020.
Knowledge graphs are incomplete by nature, with only a limited number of observed facts from the world knowledge being represented as structured relations between entities. To partly address this issue, an important task in statistical relational learning is that of link prediction or knowledge graph completion. Both linear and non-linear models have been proposed to solve the problem. Bilinear models, while expressive, are prone to overfitting and lead to quadratic growth of parameters in number of relations. Simpler models have become more standard, with certain constraints on bilinear map as relation parameters. In this work, we propose a factorized bilinear pooling model, commonly used in multi-modal learning, for better fusion of entities and relations, leading to an efficient and constraint-free model. We prove that our model is fully expressive, providing bounds on the embedding dimensionality and factorization rank. Our model naturally generalizes Tucker decomposition based TuckER model, which has been shown to generalize other models, as efficient low-rank approximation without substantially compromising the performance. Due to low-rank approximation, the model complexity can be controlled by the factorization rank, avoiding the possible cubic growth of TuckER. Empirically, we evaluate on real-world datasets, reaching on par or state-of-the-art performance. At extreme low-ranks, model preserves the performance while staying parameter efficient.
- Eleni Metheniti and Günter Neumann (2020) Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus. In LREC - International Conference on Language Resources and Evaluation (LREC-2020) May 1-4 LREC 5/2020.
Multilingual, inflectional corpora are a scarce resource in the NLP community, especially corpora with annotated morpheme boundaries. We are evaluating a generated, multilingual inflectional corpus with morpheme boundaries, generated from the English Wiktionary (Metheniti and Neumann, 2018), against the largest, multilingual, high-quality inflectional corpus of the UniMorph project (Kirov et al., 2018). We confirm that the generated Wikinflection corpus is not of such quality as UniMorph, but we were able to extract a significant amount of words from the intersection of the two corpora. Our Wikinflection corpus benefits from the morpheme segmentations of Wiktionary/Wikinflection and from the manually-evaluated morphological feature tags of the UniMorph project, and has 216K lemmas and 5.4M word forms, in a total of 68 languages.
- Katherine Dunfield and Günter Neumann (2020) Automatic Quantitative Prediction of Severity in Fluent Aphasia Using Sentence Representation Similarity . In Proceedings of RaPID-2020 at LREC-2020.
Aphasia is a neurological language disorder that can severely impair a person’s language production or comprehension abilities. Due to the nature of impaired comprehension, as well as the lack of substantial annotated data of aphasic speech, quantitative measures of comprehension ability in aphasic individuals are not easily obtained directly from speech. Thus, the severity of some fluent aphasia types has remained difficult to automatically assess. We investigate six proposed features to capture symptoms of fluent aphasia — three of which are focused on aspects of impaired comprehension ability, and evaluate them on their ability to model aphasia severity. To combat the issue of data sparsity, we exploit the dissimilarity between aphasic and healthy speech by leveraging word and sentence representations from a large corpus of non-aphasic speech, with the hypothesis that conversational dialogue contains implicit signifiers of comprehension. We compare results obtained using different regression models, and present proposed feature sets which correlate (best Pearson p = 0.619) with Western Aphasia Battery-Revised Aphasia Quotient (WAB-R AQ). Our experiments further demonstrate that we can achieve an improvement over a baseline through the addition of the proposed features for both WAB-R AQ prediction and Auditory-Verbal Comprehension WAB sub-test score prediction.
- Anna Vechkaeva and Günter Neumann (2020) Latent Feature Generation with Adversarial Learning for Aphasia Classification . In Proceedings of RaPID-2020 at LREC-2020.
Aphasia is a language disorder resulting from brain damage, and can be categorised into types according to the symptoms. Automatic aphasia classification would allow for quick preliminary assessment of the patients’ language disorder. A supervised approach to automatic aphasia classification would require substantial amount of training data, however, aphasia data is sparse. In this work, we attempt to use data generation, namely Generative Adversarial Networks (GANs), to deal with data sparsity. The latent feature generation approach is used to deal with the text generation non-differentiability problem, which is an issue for GANs. The approach using artificially generated data to augment training set was tested. We conclude through running a series of experiments that it has potential to improve aphasia classification in the context of low resource data, provided that the available data is enough for the generative model to properly learn the distribution.
- Saadullah Amin, Katherine Dunfield, Anna Vechkaeva and Günter Neumann (2020) A Data-driven Approach for Noise Reduction in Distantly Supervised Biomedical Relation Extraction . In Proceedings of BioNLP-2020 at ACL-2020.
Fact triples are a common form of structured knowledge used within the biomedical domain. As the amount of unstructured scientific texts continues to grow, manual annotation of these texts for the task of relation extraction becomes increasingly expensive. Distant supervision offers a viable approach to combat this by quickly producing large amounts of labeled, but considerably noisy, data. We aim to reduce such noise by extending an entity-enriched relation classification BERT model to the problem of multiple instance learning, and defining a simple data encoding scheme that significantly reduces noise, reaching state-of-the-art performance for distantly-supervised biomedical relation extraction. Our approach further encodes knowledge about the direction of relation triples, allowing for increased focus on relation learning by reducing noise and alleviating the need for joint learning with knowledge graph completion.
Online version- Dominik Stammbach and Günter Neumann (2019) Team DOMLIN: Exploiting Evidence Enhancement for the FEVER Shared Task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), EMNLP workshop, 2019.
This paper contains our system description for the second Fact Extraction and VERification (FEVER) challenge. We propose a two-staged sentence selection strategy to account for examples in the dataset where evidence is not only conditioned on the claim, but also on previously retrieved evidence. We use a publicly available document retrieval module and have fine-tuned BERT checkpoints for sentence selection and as the entailment classifier. We report a FEVER score of 68.46% on the blind testset.
Online version.- Eleni Metheniti, Pomi Baram Park, and Günter Neumann (2019) Identifying grammar rules for language education with dependency parsing in German. In Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019), 2019.
We propose a method of determining the syntactic difficulty of a sentence, using syntactic patterns that identify grammatical rules on dependency parses. We have constructed a novel query language based on constraint-based dependency grammars and a grammar of German rules (relevant to primary school education) with patterns in our language. We annotated these rules with difficulty score and grammatical prerequisites and built a matching algorithm that matches the dependency parse of a sentence in CoNLL-U format with its relevant syntactic patterns. We achieved 96% precision and 95% recall on a manually annotated set of sentences, and our best results on using parses from four parsers are 88% and 84% respectively.
Online version.- Saadullah Amin, Günter Neumann, Katherine Dunfield, Anna Vechkaeva, Kathryn Annette Chapman, and Morgan Kelly Wixted (2019) MLT-DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT. In working notes of CLEF eHealth, 2019.
With the adoption of electronic health record (EHR) systems, hospitals and clinical institutes have access to large amounts of heterogeneous patient data. Such data consists of structured (insurance details, billing data, lab results etc.) and unstructured (doctor notes, admission and discharge details, medication steps etc.) documents, of which, latter is of great significance to apply natural language processing (NLP) techniques. In parallel, recent advancements in transfer learning for NLP has pushed the state-of-the-art to new limits on many language understanding tasks. Therefore, in this paper, we present team DFKI-MLT's participation at CLEF eHealth 2019 Task 1 of automatically assigning ICD-10 codes to non-technical summaries (NTSs) of animal experiments where we use various architectures in multi-label classification setting and demonstrate the effectiveness of transfer learning with pre-trained language representation model BERT (Bidirectional Encoder Representations from Transformers) and its recent variant BioBERT. We first translate task documents from German to English using automatic translation system and then use BioBERT which achieves an F1-micro of 73.02% on submitted run as evaluated by the challenge.
Online version.- Dominik Stammbach, Stalin Varanasi, and Günter Neumann (2019) DOMLIN at SemEval-2019 Task 8: Automated Fact Checking exploiting Ratings in Community Question Answering Forums. In proceedings of International Workshop on Semantic Evaluation, 2019.
In the following, we describe our system developed for the Semeval2019 Task 8. We fine-tuned a BERT checkpoint on the qatar livingforum dump and used this checkpoint to traina number of models. Our hand-in for subtaskA consists of a fine-tuned classifier from this BERT checkpoint. For subtask B, we first have a classifier deciding whether a comment is factual or non-factual. If it is factual, we retrieve intra-forum evidence and using this evidence, have a classifier deciding the comment’s veracity. We trained this classifier on ratings which we crawled from qatarliving.com
Online version (page 1145 pp)- Alejandro Figueroa, Carlos Gómez-Pantoja, and Günter Neumann (2019) Integrating heterogeneous sources for predicting question temporal anchors across Yahoo! Answers. In journal Information Fusion, Volume 50, October 2019, Pages 112-125.
Modern Community Question Answering (CQA) web forums provide the possibility to browse their archives using question-like search queries as in Information Retrieval (IR) systems. Although these traditional IR methods have become very successful at fetching semantically related questions, they typically leave unconsidered their temporal relations. That is to say, a group of questions may be asked more often during specific recurring time lines despite being semantically unrelated. In fact, predicting temporal aspects would not only assist these platforms in widening the semantic diversity of their search results, but also in re-stating questions that need to refresh their answers and in producing more dynamic, especially temporally-anchored, displays.
In this paper, we devised a new set of time-frame specific categories for CQA questions, which is obtained by fusing two distinct earlier taxonomies (i.e., [29] and [50]). These new categories are then utilized in a large crowd-sourcing based human annotation effort. Accordingly, we present a systematical analysis of its results in terms of complexity and degree of difficulty as it relates to the different question topics.
Incidentally, through a large number of experiments, we investigate the effectiveness of a wider variety of linguistic features compared to what has been done in previous works. We additionally mix evidence/features distilled directly and indirectly from questions by capitalizing on their related web search results. We finally investigate the impact and effectiveness of multi-view learning to boost a large variety of multi-class supervised learners by optimizing a latent layer build on top of two views: one composed of features harvested from questions, and the other from CQA meta data and evidence extracted from web resources (i.e., nippets and Internet archives).- Eleni Metheniti and Günter Neumann (2018) Wikinflection: Massive semi-supervised generation of multilingual inflectional corpus from Wiktionary. Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), Linköping Electronic Conference Proceedings, Oslo, Norway, Linköping University Electronic Press, Linköpings universitet, 12/2018
Wiktionary is an open- and crowd-sourced dictionary which has been an important resource for natural language processing/understanding/generation tasks, but a big portion of the available information, such as inflection, is hard to retrieve and has not been widely utilized. In this paper, we are describing our efforts to generate inflectional paradigms for lemmata of the English Wiktionary, by using both the dynamic links of the XML dump file and the static information of the web version. Our system can generate inflectional paradigms for 225K lemmata, with almost 8,5M forms from 1.708 inflectional templates, for over 150 languages, and after evaluating the generation, 216K lemmata and around 6M forms are of high quality. In addition, we retrieve morphological features, affixes and stem allomorphs for each paradigm and form. The system can produce a structured inflectional corpus from any version of the English Wiktionary XML dump file, and could also be adapted for other language versions. The first version of the source code is currently available online.
- Ekaterina Loginova, Stalin Varanasi, and Günter Neumann (2018) Towards Multilingual Neural Question Answering. 1st International Workshop ARTIFICIAL INTELLIGENCE for QUESTION ANSWERING (AIQA2018) , Communications in Computer and Information Science (Springer), Budapest, Hungary, Springer, 2018
Cross-lingual and multilingual question answering is a critical part of a successful and accessible natural language interface. However, many current solutions are unsatisfactory. We believe that recent developments in deep learning approaches are likely to be efficient for question answering tasks spanning several languages. This work aims to discuss current achievements and remaining challenges. We outline requirements and suggestions for practical parallel data collection and describe existing methods and datasets. We also demonstrate that a simple translation of texts can be inadequate in case of Arabic, English and German languages (on InsuranceQA and SemEval datasets), and thus more sophisticated models are required. We hope that our findings will ignite interest in neural approaches to multilingual question answering.
- Ekaterina Loginova and Günter Neumann (2018) An Interactive Web-Interface for Visualizing the Inner Workings of the Question Answering LSTM. In proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP-2018, October 31 – November 4, Brussels, Belgium, 2018.
Deep learning models for NLP are potent but not readily interpretable. It prevents researchers from improving a model’s performance efficiently and users from applying it for a task which requires a high level of trust in the system. We present a visualisation tool which aims to illuminate the inner workings of a specific LSTM model for question answering. It plots heatmaps of neurons’ firings and allows a user to check the dependency between neurons and manual features. The system possesses an interactive web-interface and can be adapted to other models and domains.
- Khyathi Raghavi Chandu, Ekaterina Loginova, Vishal Gupta, Josef van Genabith, Günter Neumann, Manoj Kumar Chinnakotla, Eric Nyberg, Alan W. Black (2018) Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques. In proceedings of the Third Workshop on Computational Approaches to Linguistic - Code-Switching@ACL 2018, Melbourne, Australia, July 19, 2018.
Code-Mixing (CM) is the phenomenon of alternating between two or more languages which is prevalent in bi- and multi-lingual communities. Most NLP applications today are still designed with the assumption of a single interaction language and are most likely to break given a CM utterance with multiple languages mixed at a morphological, phrase or sentence level. For example, popular commercial search engines do not yet fully understand the intents expressed in CM queries. As a first step towards fostering research which supports CM in NLP applications, we systematically crowd-sourced and curated an evaluation dataset for factoid question answering in three CM languages - Hinglish (Hindi+English), Tenglish (Telugu+English) and Tamlish (Tamil+English) which belong to two language families. We share the details of our data collection process, techniques which were used to avoid inducing lexical bias amongst the crowd workers and other CM specific linguistic properties of the dataset. Our final dataset, which is available freely for research purposes, has 1,694 Hinglish, 2,848 Tamlish and 1,391 Tenglish factoid questions and their answers. We discuss the techniques used by the participants for the first edition of this ongoing challenge.
- Tyler Renslow and Günter Neumann (2018) LightRel at SemEval-2018 Task 7: Lightweight, Fast and Robust Relation Classification. In proceedings of SemEval-2018 - International Workshop on Semantic Evaluation, June, 2018.
We present LightRel, a lightweight, fast and robust relation classifier. Our goal is to develop a high baseline for different relation extraction tasks. By defining only very few data-internal, word-level features and external knowledge sources in the form of word clusters and word embeddings, we train a fast and simple linear classifier.
Online version- Georg Heigold, Stalin Varanasi, Günter Neumann and Josef van Genabith (2018) How Robust Are Character-Based Word Embeddings in Tagging and MT Against Wrod Scramlbing or Randdm Nouse?. In proceedings of AMTA 2018
This paper investigates the robustness of NLP against perturbed word forms. While neural approaches can achieve (almost) human-like accuracy for certain tasks and conditions, they often are sensitive to small changes in the input such as non-canonical input (e.g., typos). Yet both stability and robustness are desired properties in applications involving user-generated content, and the more as humans easily cope with such noisy or adversary conditions. In this paper, we study the impact of noisy input. We consider different noise distributions (one type of noise, combination of noise types) and mismatched noise distributions for training and testing. Moreover, we empirically evaluate the robustness of different models (convolutional neural networks, recurrent neural networks, non-neural models), different basic units (characters, byte pair encoding units), and different NLP tasks (morphological tagging, machine translation).
Online version- Philip Hake, Peter Fettke, Günter Neumann, Peter Loos (2017) Extracting Business Objects and Activities from Labels of German Process Models. In proceedings of Designing the Digital Transformation - 12th International Conference, DESRIST 2017, Karlsruhe, Germany, May 30 - June 1, 2017.
To automatically analyze and compare elements of process models, investigating the natural language contained in the labels of the process models is inevitable. Therefore, the adaption of well-established techniques from the field of natural language processing to Business Process Management has recently experienced a growth. Our work contributes to the field of natural language processing in business process models by providing a word dependency-based technique for the extraction of business objects and activities from German labeled process models. Furthermore, we evaluate our approach by implementing it in the RefMod-Miner toolset and measuring the quality of the information extraction in business process models. In three different evaluation scenarios, we show the strengths of the dependency-based approach and give an outlook on how further research could benefit from the approach.
Appears in: Alexander Maedche, Jan vom Brocke, Alan Hevner (eds.): 1 Designing the Digital Transformation volume 10243, Lecture Notes in Computer Science, Pages 21-38, Karlsruhe, Germany, Springer International Publishing, 2017- Georg Heigold, Günter Neumann and Josef van Genabith (2017) An Extensive Empirical Evaluation of Character-Based Morphological Tagging for 14 Languages. In proceedings of EACL-2017
This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets. Character-based approaches are attractive as they can handle rarely- and unseen words gracefully. We evaluate on 14 languages and observe consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where we match the state-of-the-art. We compare two architectures for computing character-based word vectors using recurrent (RNN) and convolutional (CNN) nets. We show that the CNN based approach performs slightly worse and less consistently than the RNN based approach. Small but systematic gains are observed when combining the two architectures by ensembling.
Online version Download preprint- Georg Heigold, Günter Neumann and Josef van Genabith (2016) Scaling Character-Based Morphological Tagging to Fourteen Languages . Proceedings of IEEE International Conference on Big Data, Washington, DC, DC, USA, IEEE, IEEE, 12/2016
This paper investigates neural character-based morphological tagging for languages with complex morphology and large tag sets. Character-based approaches are attractive as they can handle rarely- and unseen words gracefully. More specifically, beside a rich morphology, non-canonical language, change of language or other linguistic variability can heavily degrade the accuracy of natural language processing of web and CMC data. We evaluate on 14 languages and observe consistent gains over a state-of-the-art morphological tagger across all languages except for English and French, where we match the state-of-the-art. The gains are clearly correlated with the amount of training data. We present supplementary experiments to explore whether and to what extent unsuper- vised data through pre-trained word vectors can compensate for limited amounts of supervised data. Moreover, we show preliminary results to study the effect of noisy input data by flipping characters at random.
Download paper- Stalin Varanasi and Günter Neumann (2016) Question/Answer Matching for Yahoo! Answers Using a Corpus-Based Extracted Ngram-based Mapping. The Twenty-Fourth Text REtrieval Conference (TREC 2015) Proceedings, NIST Special Publication: SP 500-319, Gaithersburg, MD, USA, NIST, NIST, 2/2016
This report describes the work done by the QA group of the Multilingual Technologies Lab at DFKI for the 2015 edition of the TREC LiveQA track. We describe the system, issues faced and the approaches followed considering the time lines of the track.
Download paper- Alejandro Figueroa and Günter Neumann (2016) Context-Aware Semantic Classification of Search Queries for Browsing Community Question-Answering Archives. In Journal Knowledge-Based Systems, Volume 96, pp. 1-13 (15 March 2016)
Community question answering (cQA) platforms, like Yahoo! Answers, provide standard search APIs to browse past questions and answers. Since, the largest cQA services maintain massive sets of resolved questions, the necessity for effective methods to revitalize the information contained in their archives is getting more and more important to serve the needs of their members as promptly and as reliably as possible.
In this paper, we present a novel strategy for effectively browsing cQA archives. The core idea is to induce the semantic classes of question-like search queries (e.g., “rib pain after ovulation” and “iron oxide household”) by means of the contextual information set up or represented by inferred views of their respective search sessions, namely views modelling previous queries entered by the same user.
When searching cQA archives, members do not associate semantic classes to their queries, so we are considering the cQA as a knowledge base that defines a taxonomy of semantic classes for their posted questions, which provides an explicit mapping between search queries and these questions. Following a supervised learning approach, we investigate and analyse the most salient features that are necessary to automatically exploit this relevant cQA mapping.
We carried out a large number of experiments using a rich set of attributes extracted from an automatically acquired big dataset from Yahoo! Answers. Our results confirm what was often only intuitively assumed, namely, that larger contexts actually help to detect semantic relations implicitly expressed in sequences of queries submitted by users during their search sessions. In particular, we discover that Explicit Semantic Analysis is extremely helpful for inferring discriminative semantic cues that reduce, and thus determine, the semantic range of question-like search queries. Conversely, constructing traditional bag-of-words models on top of prior queries in the session was detrimental.
Keywords:Community question answering; Semantic classification of question-like search queries; Large-scale feature selection for search session analysis; Explicit Semantic Analysis
Online version Download preprint- Bernardo Magnini, Ido Dagan, Günter Neumann, and Sebastian Pado (2014) Entailment Graphs for Text Analytics in the Excitement Project. Proceedings of the 17th International Conference on Text, Speech and Dialogue (TSD-2014), Brno, Czech Republic, September 8–12 2014.
In the last years, a relevant research line in Natural Language Processing has focused on detecting semantic relations among portions of text, including entailment, similarity, temporal relations, and, with a less degree, causality. The attention on such semantic relations has raised the demand to move towards more informative meaning representations, which express properties of concepts and relations among them. This demand triggered research on "statement entailment graphs", where nodes are natural language statements (propositions), comprising of predicates with their arguments and modifiers, while edges represent entailment relations between nodes.
We report initial research that defines the properties of entailment graphs and their potential applications. Particularly, we show how entailment graphs are profitably used in the context of the European project EXCITEMENT, where they are applied for the analysis of customer interactions across multiple channels, including speech, email, chat and social media, and multiple languages (English, German, Italian).- Günter Neumann, Gerhard Paaß, and David van den Akker (2014) Linguistics to Structure Unstructured Information. Towards the Internet of Services: The THESEUS Program, in Wolfgang Wahlster; Hans-Joachim Grallert; Stefan Wess; Hermann Friedrich; Thomas Widenka (eds), Springer International Publishing Switzerland, ISBN 978-3-319-06755-1, pp. 383-392, 2014.
The extraction of semantics of unstructured documents requires the recognition and classification of textual patterns, their variability and their inter-relationships, i.e. the analysis of the linguistic structure of documents. Being the integral part of a larger real-life application, this linguistic analysis process must be robust, fast and adaptable. This creates a big challenge for the development of the necessary linguistic base components. In this drill-down we present several dimensions of this challenge and show how they have been successfully tackled in ORDO.
- Kathrin Eichler, Aleksandra Gabryszak, Günter Neumann (2014) An analysis of textual inference in German customer emails. Proceedings of the Third Joint Conference on Lexical and Computational Semantics, Dublin, 23/24 August 2014.
Human language allows us to express the same meaning in various ways. Recognizing that the meaning of one text can be inferred from the meaning of another can be of help in many natural language processing applications. One such application is the categorization of emails. In this paper, we describe the analysis of a real-world dataset of manually categorized customer emails written in the German language. We investigate the nature of textual inference in this data, laying the ground for developing an inference-based email categorization system. This is the first analysis of this kind on German data. We compare our results to previous analyses on English data and present major differences.
- Bernardo Magnini, Roberto Zanoli, Ido Dagan, Kathrin Eichler, Günter Neumann, Tae-Gil Noh, Sebastian Pado, Asher Stern, and Omer Levy (2014) The Excitement Open Platform for Textual Inferences. The 52nd Annual Meeting of the Association for Computational Linguistics (ACL-2014), demo paper, Baltimore, USA, 2014.
This paper presents the Excitement Open Platform (EOP), a generic architecture and a comprehensive implementation for textual inference in multiple languages. The platform includes state-of-art algorithms, a large number of knowledge resources, and facilities for experimenting and testing innovative approaches. The EOP is distributed as an open source software.
- Alejandro Figueroa and Günter Neumann (2014) Category-specific models for ranking effective paraphrases in community Question Answering. Expert Systems With Applications, Volume 41, Issue 10, August 2014, Pages 4730–4742.
Highlights
- Subjective and objective nature of cQA questions affect selection of past answers.
- Information learned from category-specific paraphrases improves ranking for cQA.
- Experiments with big data from Yahoo! Answers and Yahoo! Search logs.
- We conduct experiments on fine-grained question categories from Yahoo! Answers.
Platforms for community-based Question Answering (cQA) are playing an increasing role in the synergy of information-seeking and social networks. Being able to categorize user questions is very important, since these categories are good predictors for the underlying question goal, viz. informational or subjective. Furthermore, an effective cQA platform should be capable of detecting similar past questions and relevant answers, because it is known that a high number of best answers are reusable. Therefore, question paraphrasing is not only a useful but also an essential ingredient for effective search in cQA. However, the generated paraphrases do not necessarily lead to the same answer set, and might differ in their expected quality of retrieval, for example, in their power of identifying and ranking best answers higher.
We propose a novel category-specific learning to rank approach for effectively ranking paraphrases for cQA. We describe a number of different large-scale experiments using logs from Yahoo! Search and Yahoo! Answers, and demonstrate that the subjective and objective nature of cQA questions dramatically affect the recall and ranking of past answers, when fine-grained category information is put into its place. Then, category-specific models are able to adapt well to the different degree of objectivity and subjectivity of each category, and the more specific the models are, the better the results, especially when benefiting from effective semantic and syntactic features.
- Günter Neumann and Sven Schmeier (2013) Strategies for Guided Exploratory Search on the Mobile Web. Knowledge Discovery, Knowledge Engineering and Knowledge Management. 4th International Joint Conference, IC3K 2012, Barcelona, Spain, October 4-7, 2012. Revised Selected Papers Series: Communications in Computer and Information Science, Vol. 415, 2013.
We propose and develop new innovative methods for guided exploratory search on the mobile web. The approach has been fully implemented in a system called MobEx on a tablet, i.e. an Apple iPad, and on a mobile device/phone, i.e. Apple iPhone or iPod. Starting from a user's search query a set of web snippets is collected by a standard search engine in a first step. After that the snippets are collected into one document from which the topic graph is computed. This topic graph is presented to the user in different touchable and interactive graphical representations depending on the screensize of the mobile device. However due to possible semantic ambiguities in the search queries the snippets may cover different thematic areas and so the topic graph may contain associated topics for different semantic entities of the original query. This may lead the user to wrong directions while exploring the solution space. Hence we present our approach for an interactive disambiguation of the search query and so we provide assistance for the users towards a guided exploratory search.
- Alejandro Figueroa and Günter Neumann (2013) Exploiting User Search Sessions for the Semantic Categorization of Question-like Informational Search Queries. 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), Nagoya, Japan, October, 2013.
This work proposes to semantically classify question-like search queries (e.g., “oil based heel creams”) based on the context yielded by preceding search queries in the same user session. Our novel approach is promising as our initial results show that the classification accuracy improved in congruence with the number of previous queries used to model the question context.
- Alejandro Figueroa and Günter Neumann (2013) Learning to Rank Effective Paraphrases from Query Logs for Community Question Answering. Twenty Seventh Association for Advancement of Artificial Intelligence Conference (AAAI 2013), Bellevue, Washington, July, 2013.
We present a novel method for ranking query paraphrases for effective search in community question answering (cQA). The method uses query logs from Yahoo! Search and Yahoo! Answers for automatically extracting a corpus of paraphrases of queries and questions using the query-question click history. Elements of this corpus are automatically ranked according to recall and mean reciprocal rank, and then used for learning two independent learning to rank models (SVMRank), whereby a set of new query paraphrases can be scored according to recall and MRR. We perform several automatic evaluation procedures using cross-validation for analyzing the behavior of various aspects of our learned ranking functions, which show that our method is useful and effective for search in cQA.
- Günter Neumann and Sven Schmeier (2013) MobEx - a System for Exploratory Search on the Mobile Web. The Agents and Artificial Intelligence, Revised Selected Papers, Series: Communications in Computers and Information Science, volume 358 Pages 116-130, Springer, 3/2013.
We present MobEx, a mobile touchable application for exploratory search on the mobile web. The system has been implemented for operation on a tablet computer, i.e. an Apple iPad, and on a mobile device, i.e. Apple iPhone or iPod touch. Starting from a topic issued by the user the system collects web snippets that have been determined by a standard search engine in a first step and extracts associated topics to the initial query in an unsupervised way on-demand and highly performant. This process is recursive in principle as it furthermore determines other topics associated to the newly found ones and so forth. As a result MobEx creates a dense web of associated topics that is presented to the user as an interactive topic graph. We consider the extraction of topics as a specific empirical collocation extraction task where collocations are extracted between chunks combined with the cluster descriptions of an online clustering algorithm. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. These syntactically-oriented chunk pairs are then semantically ranked and filtered using the cluster descriptions created by a Singular Value Decomposition (SVD) approach. An initial user evaluation shows that this system is especially helpful for finding new interesting information on topics about which the user has only a vague idea or even no idea at all.
- Alexander Volokh and Günter Neumann (2012) Parsing Hindi with MDParser, Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), pages 149–154, COLING 2012, Mumbai, December 2012.
We describe our participation in the MTPIL Hindi Parsing Shared Task-2012. Our system achieved the following results: 82.44% LAS/90.91% UAS (auto) and 85.31% LAS/92.88% UAS (gold). Our parser is based on the linear classification, which is suboptimal as far as the accuracy is concerned. The strong point of our approach is its speed. For parsing development the system requires 0.935 seconds, which corresponds to a parsing speed of 1318 sentences per second. The Hindi Treebank contains much less different part of speech tags than many other treebanks and therefore it was absolutely necessary to use the additional morphosyntactic features available in the treebank. We were able to build classifiers predicting those, using only the standard word form and part of speech features, with a high accuracy.
- Amir H. Moin and Günter Neumann (2012) Assisting bug Triage in Large Open Source Projects Using Approximate String Matching. The Seventh International Conference on Software Engineering Advances (ICSEA 2012), Lisbon, Portugal, November, 2012.
In this paper, we propose a novel approach for assisting human bug triagers in large open source software projects by semi-automating the bug assignment process. Our approach employs a simple and efficient n-gram-based algorithm for approximate string matching on the character level. We propose and implement a recommender prototype which collects the natural language textual information available in the summary and description fields of the previously resolved bug reports and classifies that information in a number of separate inverted lists with respect to the resolver of each issue. These inverted lists are considered as vocabulary-based expertise and interest models of the developers. Given a new bug report, the recommender creates all possible n-grams of the strings, evaluates their similarities to the available expertise models concerning a number of well-known string similarity measures, namely Cosine, Dice, Jaccard and Overlap coefficients. Finally, the top three developers are recommended as proper candidates for resolving this new issue. Experimental results on 5200 bug reports of the Eclipse JDT project show weighted average precision value of 90.1% and weighted average recall value of 45.5%.
The paper received a best paper award.
- Carolin Shihadeh and Günter Neumann (2012) ARNE - A tool for Namend Entity Recognition from Arabic Text. The Fourth Workshop on Computational Approaches to Arabic Script-based Languages (AMTA 2012), San Diego, CA USA, 2012.
In this paper, we study the problem of finding named entities in the Arabic text. For this task we present the development of our pipeline software for Arabic named entity recognition (ARNE), which includes tokenization, morphological analysis, Buckwalter transliteration, part of speech tagging and named entity recognition of person, location and organisation named entities. In our first attempt to recognize named entities, we have used a simple, fast and language independent gazetteer lookup approach. In our second attempt, we have used the morphological analysis provided by our pipeline to remove affixes and observed hence an improvement in our performance. The pipeline presented in this paper, can be used in future as a basis for a named entity recognition system that recognized named entities not only using gazetteers, but also making use of morphological information and part of speech tagging.
- Günter Neumann and Sven Schmeier (2012) Guided Exploratory Search on the Mobile Web. 4th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KDIR-2012), Barcelona, Spain, October, 2012.
We present a mobile touchable application for guided exploration of web content and online topic graph extraction that has been successfully implemented on a tablet, i.e. an Apple iPad, and on a mobile device/phone, i.e. Apple iPhone or iPod. Starting from a user’s search query a set of web snippets is collected by a standard search engine in a first step. After that the snippets are collected into one document from which the topic graph is computed. This topic graph is presented to the user in different touchable and interactive graphical representations depending on the screensize of the mobile device. However due to possible semantic ambiguities in the search queries the snippets may cover different thematic areas and so the topic graph may contain associated topics for different semantic entities of the original query. This may lead the user to wrong directions while exploring the solution space. Hence we present our approach for an interactive disambiguation of the search query and so we provide assistance for the users towards a guided exploratory search.
Paper was in short list of candidates to win the best paper award.
- Günter Neumann and Sven Schmeier (2012) Interactive Topic Graph Extraction and Exploration of Web Content. Poibeau, T.; Saggion, H.; Piskorski, J.; Yangarber, R. (Eds.) Multi-source, Multilingual Information Extraction and Summarization, Springer, Series: Theory and Applications of Natural Language Processing, June, 2012.
In the following, we present an approach using interactive topic graph extraction for the exploration of web content. The initial information request, in the form of a query topic description, is issued online by a user to the system. The topic graph is then constructed from N web snippets that are produced by a standard search engine. We consider the extraction of a topic graph to be a specific empirical collocation extraction task, where collocations are extracted between chunks. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. This topic graph can then be further analyzed by users so that they can request additional background information with the help of interesting nodes and pairs of nodes in the topic graph, e.g., explicit relationships extracted from Wikipedia or those automatically extracted from additional Web content as well as conceptual information of the topic in form of semantically oriented clusters of descriptive phrases. This information is presented to the users, who can investigate the identified information nuggets to refine their information search. An initial user evaluation shows that our approach is especially helpful for finding new interesting information on topics about which the user has only a vague idea or no idea, at all.
- Günter Neumann and Sven Schmeier (2012) Exploratory Search on the Mobile Web. 4th International Conference on Agents and Artificial Intelligence (ICAART 2012), Vilamoura, Algarve, Portugal.
We present a mobile touchable application for online topic graph extraction and exploration of web content. The system has been implemented for operation on a tablet computer, i.e. an Apple iPad, and on a mobile device, i.e. Apple iPhone or iPod touch. The topics are extracted from web snippets which are determined by a standard search engine. We consider the extraction of topics as a specific empirical collocation extraction task where collocations are extracted between chunks combined with the cluster descriptions of an online clustering algorithm. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. These syntactically-oriented chunk pairs are then semantically ranked and filtered using the cluster descriptions. An initial user evaluation shows that this system is especially helpful for finding new interesting information on topics about which the user has only a vague idea or even no idea at all.
- Alexander Volokh and Günter Neumann (2012) Extending Dependency Treebanks with Good Sentences, 11th Conference on Natural Language Processing (Konvens-2012), Vienna, Austria, September, 2012.
For many resource-poor languages additional annotated data would be beneficial. However, the annotation process is tedious and expensive. We propose a metric for selecting the most promising sentences for annotation. Annotating only good sentences saves time and would allow better results to be achieved even with a smaller amount of annotated data. We demonstrate how our method works on the example of parsing a Finnish dependency treebank with MaltParser.
- Alexander Volokh and Günter Neumann (2012) Transition-based Dependency Parsing with Efficient Feature Extraction, 35th German Conference on Artificial Intelligence (KI-2012), Saarbrücken, Germany, September, 2012.
The fastest parsers currently reported in the literature can parse an average sentence in up to 2.5ms, a considerable improvement, since most of the older accuracy-oriented parsers parse only few sentences per second. It is generally accepted that the complexity of a parsing algorithm is decisive for the performance of a parser. However, we show that the most time consuming part of processing is feature extraction and therefore an algorithm which allows efficient feature extraction can outperform a less complex algorithm which does not. Our system based on quadratic Covington's parsing strategy with efficient feature extraction is able to parse an average English sentence in only 0.8ms without any parallelisation.
- Alexander Volokh and Günter Neumann (2012) Task-oriented Dependency Parsing Evaluation Methodology, IEEE 13th International Conference on Information Reuse and Integration, IEEE Systems, Man, and Cybernetics Society (SMC), 2012.
Traditional parser evaluation with attachment scores is not helpful for researchers who want to find the most suitable parser for their application. First, because it is being done for a domain which is almost always different from the domain of the application and second because many of the tested dependencies are irrelevant for the application. The alternative extrinsic evaluation is problematic as well, since it is difficult to find a suitable data set and because it is not straightforward how to measure the quality of the parser in the context of a broader appllication. We propose a method which combines the strengths of attachment scores and extrinsic evaluation and avoids their weaknesses. We apply our approach to RTE-7 data in order to demonstrate how it works.
- Abdelhadi Soudi, Ali Farghaly, Günter Neumann and Rabih Zbib (Editors) (2012) Challenges for Arabic Machine Translation, Book series: Natural Language Processing, Vol. 9, 157 p., John Benjamins Publishing Company.
This book is the first volume that focuses on the specific challenges of machine translation with Arabic either as source or target language. It nicely fills a gap in the literature by covering approaches that belong to the three major paradigms of machine translation: Example-based, statistical and knowledge-based. It provides broad but rigorous coverage of the methods for incorporating linguistic knowledge into empirical MT. The book brings together original and extended contributions from a group of distinguished researchers from both academia and industry. It is a welcome and much-needed repository of important aspects in Arabic Machine Translation such as morphological analysis and syntactic reordering, both central to reducing the distance between Arabic and other languages. Most of the proposed techniques are also applicable to machine translation of Semitic languages other than Arabic, as well as translation of other languages with a complex morphology.
- Bogdan Sacaleanu and Günter Neumann (2012) An Adaptive Framework for Named Entity Combination. 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.
We have developed a new OSGi-based platform for Named Entity Recognition (NER) which uses a voting strategy to combine the results produced by several existing NER systems (currently OpenNLP, LingPipe and Stanford). The different NER systems have been systematically decomposed and modularized into the same pipeline of preprocessing components in order to support a flexible selection and ordering of the NER processing flow. This high modular and component-based design supports the possibility to setup different constellations of chained processing steps including alternative voting strategies for combining the results of parallel running components.
- Alexander Volokh and Günter Neumann (2011) Using MT-Based Metrics for RTE. The Fourth Text Analysis Conference (TAC 2011), Gaithersburg, MD, USA.
We analyse the complexity of the RTE task data and divide the T/H pairs into three different classes, depending on the type of knowledge required to solve the problem. We then propose an approach which is suitable for the easier two classes, which account for two thirds of all pairs. Our assumption is that T and H are translations of the same source sentence. We then use a metric for MT evaluation (Meteor) in order to judge the similarity of both translations. It is clear that in most cases when entails H, T and H do not have exactly the same meaning. However, we can observe that the similarity is still much higher for positive T/H pairs than for negative pairs. We achieve a result of 46.34 macro-average F1-score for the task. On one hand-side, it shows that our approach has its weaknesses especially because our assumption that T and H contain the same meaning does not always hold, especially if T and H have very different lengths. On the other hand considering the fact that RTE-7 is a difficult class-imbalanced problem (<5%YES, >95% NO) this robust approach achieves a decent result for a large amount of data. It is above the median of this year's results and is comparable with the top results from the previous year.
- Óscar Ferrández, Christian Spurk, Milen Kouylekov, Iustin Dornescu, Sergio Ferrández, Matteo Negri, Rubén Izquierdo, David Tomás, Constantin Orasan, Günter Neumann, Bernardo Magnini and Jose Luis Vicedo (2011) The QALL-ME Framework: A specifiable-domain multilingual Question Answering architecture. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, online version, doi:10.1016/j.websem.2011.01.002.
This paper presents the QALL-ME Framework, a reusable architecture for building multi- and cross-lingual Question Answering (QA) systems working on structured data modeled by an ontology. It is released as free open source software with a set of demo components and extensive documentation, which makes it easy to use and adapt. The main characteristics of the QALL-ME Framework are: (i) its domain portability, achieved by an ontology modeling the target domain; (ii) the context awareness regarding space and time of the question; (iii) the use of textual entailment engines as the core of the question interpretation; and (iv) an architecture based on Service Oriented Architecture (SOA), which is realized using interchangeable web services for the framework components. Furthermore, we present a running example to clarify how the framework processes questions as well as a case study that shows a QA application built as an instantiation of the QALL-ME Framework for cinema/movie events in the tourism domain.
Keywords: Question Answering; Textual entailment; Natural Language Interfaces; Multilingual environments- Günter Neumann and Sven Schmeier (2011) A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL).
We present a mobile touchable application for online topic graph extraction and exploration of web content. The system has been implemented for operation on an iPad. The topic graph is constructed from N web snippets which are determined by a standard search engine. We consider the extraction of a topic graph as a specific empirical collocation extraction task where collocations are extracted between chunks. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. An initial user evaluation shows that this system is especially helpful for finding new interesting information on topics about which the user has only a vague idea or even no idea at all.
- Alexander Volokh and Günter Neumann (2011) Automatic Detection and Correction of Errors in Dependency Treebanks. The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL).
Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments, they still contain a considerable amount of errors. With this work we want to draw attention to this fact. Additionally, we try to estimate the amount of errors and propose a method for their automatic correction. Whereas our approach is able to find only a portion of the errors that we suppose are contained in almost any annotated corpus due to the nature of the process of its creation, it has a very high precision, and thus is in any case beneficial for the quality of the corpus it is applied to. At last, we compare it to a different method for error detection in treebanks and find out that the errors that we are able to detect are mostly different and that our approaches are complementary.
- Alexander Volokh, Günter Neumann and Bogdan Sacaleanu (2010) Combining Deterministic Dependency Parsing and Linear Classification for Robust RTE. Third Text Analysis Conference, Gaithersburg, MD, USA.
We present a robust RTE approach which is built as one module incorporating all possible knowledge sources in form of different features. This way we can easily include or remove knowledge sources which are involved into the process of judging the entailment relation. We perform numerous tests in which we analyze the contribution of different types of features based on word forms, structural information, lexical semantics and named entity recognition to this process. The core of our system is our own deterministic dependency parser MDParser, which is based on a fast linear classification approach. We use the RTE6 challenge as an opportunity to evaluate its performance in a real-world application against another state of the art parser MaltParser. In our official submissions we achieve an f-score of 39.81 with MaltParser and 38.26 with MDParser. However, the parsing speed with MDParser is 26 times higher.
- Kathrin Eichler and Günter Neumann (2010) Bootstrapping Noun Groups Using Closed-Class Elements Only. KDML 2010: Knowledge Discovery, Data Mining, and Machine Learning, Kassel, Germany.
The identification of noun groups in text is a well researched task and serves as a pre-step for other natural language processing tasks, such as the extraction of key phrases or technical terms. We present a first version of a noun group chunker that, given an unannotated text corpus, adapts itself to the domain at hand in an unsupervised way. Our approach is inspired by findings from cognitive linguistics, in particular the division of language into open-class elements and closed class elements. Our system extracts noun groups using lists of closed-class elements and one linguistically inspired seed extraction rule for each open class. Supplied with raw text, the system creates an initial validation set for each open class based on the seed rules and applies a bootstrapping procedure to mutually expand the set of extraction rules and the validation sets. Possibly domain-dependent information about open-class elements, as for example provided by a part-of speech lexicon, is not used by the system in order to ensure the domain-independency of the approach. Instead, the system adapts itself automatically to the domain of the input text by bootstrapping domain-specific validation lists. An evaluation of our system on the Wall Street Journal training corpus used for the CONLL 2000 shared task on chunking shows that our bootstrapping approach can be successfully applied to the task of noun group chunking.
- Kathrin Eichler and Günter Neumann (2010) DFKI KeyWE: Ranking key phrases extracted from scientific articles. K. Erk, C. Strapparava (eds.): Association for Computational Linguistics, SigLex event SemEval-2 Evaluation Exercises on Semantic Evaluation, Uppsala, Sweden.
A central issue for making the content of a scientific document quickly accessible to a potential reader is the extraction of key phrases, which capture the main topic of the document. Key phrases can be extracted automatically by generating a list of key phrase candidates, ranking these candidates, and selecting the top-ranked candidates as key phrases. We present the KeyWE system, which uses an adapted nominal group chunker for candidate extraction and a supervised ranking algorithm based on support vector machines for ranking the extracted candidates. The system was evaluated on data provided for the SemEval 2010 Shared Task on Key phrase Extraction.
- Alexander Volokh and Günter Neumann (2010) Comparing the Benefit of Different Dependency Parsers for Textual Entailment Using Syntactic Constraints Only. K. Erk, C. Strapparava (eds.): Association for Computational Linguistics, SigLex event SemEval-2 Evaluation Exercises on Semantic Evaluation, Uppsala, Sweden.
We compare several state of the art dependency parsers with our own parser based on a linear classification technique. Our primary goal is therefore to use syntactic information only, in order to keep the comparison of the parsers as fair as possible. We demonstrate, that despite the inferior result using the standard evaluation metrics for parsers like UAS or LAS on standard test data, our system achieves comparable results when used in an application, such as the SemEval- 2 #12 evaluation exercise PETE. Our submission achieved the 4th position out of 19 participating systems. However, since it only uses a linear classifier it works 17-20 times faster than other state of the parsers, as for instance MaltParser or Stanford Parser.
- Kathrin Eichler, Holmer Hemsen, Günter Neumann, Norbert Reithinger, Sven Schmeier, Kinga Schumacher and Inessa Seifert (2010) DiLiA - The Digital Library Assistant. M. Lalmas J. Jose, A. Rauber, F. Sebastiani, I. Frommholz (eds.): Proceedings of the European Conference on Research and Advanced Technology for Digital Libraries (ECDL) 2010, Glasgow, United Kingdom, Springer.
In this paper we present the digital library assistant (DiLiA). The system aims at augmenting the search in digital libraries in several dimensions. In the project advanced information visualization methods are developed for user controlled interactive search. The interaction model has been designed in a way that it is transparent to the user and easy to use. In addition, information extraction (IE) methods have been developed in DiLiA to make the content more easily accessible, this includes the identification and extraction of technical terms (TTs) – single and multi word terms – as well as the extraction of binary relations based on the extracted terms. In DiLiA we follow a hybrid information extraction approach – a combination of metadata and document processing.
- Günter Neumann and Berthold Crysmann (2010) Extracting Supertags from HPSG-based Tree Banks. Srinivas Bangalore and Aravind Joshi (eds): Supertagging - Using Complex Lexical Descriptions in Natural Language Processing, Pages 313-335, MIT Press, Cambridge, Massachusetts.
We describe a method for the automatic extraction of a Stochastic Lexicalized Tree Insertion Grammar from a linguistically rich HPSG Treebank. The extraction method is strongly guided by HPSG–based head and argument decomposition rules. The tree anchors correspond to lexical labels encoding fine–grained information. The approach has been tested with a German corpus achieving a labeled recall of 77.33% and labeled precision of 78.27%, which is competitive to recent results reported for German parsing using the Negra Treebank.
- Rui Wang, Yi Zhang and Günter Neumann (2009) A Joint Syntactic-Semantic Representation for Recognizing Textual Relatedness. Second Text Analysis Conference, Gaithersburg, MD, USA.
This paper describes our participation in the Recognizing Textual Entailment challenge (RTE-5) in the Text Analysis Conference (TAC 2009). Following the two-stage binary classification strategy, our focus this year is to recognize related Text-Hypothesis pairs instead of entailment pairs. In particular, we propose a joint syntactic-semantic representation to better capture the key information shared by the pair, and also apply a co-reference resolver to group cross-sentential mentions of the same entities together. For the evaluation, we achieve 63.7% of accuracy on the three-way test, 68.5% on the entailment vs. non-entailment test, and 74.3% on the relatedness recognition. Based on the error analysis, we will work on differentiating entailment and contradiction in the future.
Our results ranked second best in the TAC 2009 competition.- Günter Neumann (2009) Text-basiertes Informationsmanagement. Carstensen, K.-U.; C. Ebert; C. Endriss; S. Jekat; R. Klabunde & H. Langer (Hrsg.) (2009) Computerlinguistik und Sprachtechnologie. Eine Einführung. 3., überarbeitete und erweiterte Auflage, Heidelberg: Spektrum Akademischer Verlag, Pages 576-615.
Im Bereich der Sprachtechnologie sind zur gezielten, inhaltlichen Erschließung von sehr großen Textmengen in den letzten Jahren eine Reihe von Konzepten und Technologien exploriert und entwickelt worden, insbesondere in den Bereichen des Text Minings, der Informationsextraktion und -integration, der semantischen Suche und Fragebeantwortung und der Informationspräsentation. Aktuell ist eine starke Konvergenz zwischen diesen Bereichen zu beobachten, insofern als viele Teillösungen gemeinsamen Ursprung haben, wie z. B. die linguistische Merkmalsextraktion oder die Identifikation von relevanten Entitäten und semantischen Relationen zwischen ihnen. Diese Konvergenz unter der Bezeichnung "Text-basiertes Informationsmanagement'' (TIM) aufzuzeigen und zu beschreiben ist Gegenstand dieses Artikels.
- Alejandro Figueroa, Günter Neumann and John Atkinson (2009) Searching for Definitional Answers on the Web using Surface Patterns. Journal IEEE Computer volume 42 number 4, Pages 68-76, IEEE, 4/2009.
Most definitional question answering systems integrate statistical methods along with external resources to retrieve relevant sentences for further processing. However, apart from some interesting approaches for query/answers rephrasing, methods do not take advantage of the different surface forms for a query. This paper proposes a novel approach to search for definitional answers from the web which employs query rewriting techniques in order to increase the probability of extracting the nuggets from various web snippets by matching surface patterns. Assessing the quality of these obtained snippets is carried out by using sense disambiguation and definition ranker components which jointly with clustering methods allow the candidate snippets to be ranked. This approach substantially boosts the extraction of nuggets directly from Web snippets and hence avoids the costly fetching and processing huge amounts of documents. Different experiments on the Web and using other competitive approaches show the promise of the approach to searching for definition questions from the Web.
- Inessa Seifert, Kathrin Eichler, Holmer Hemsen, Sven Schmeier, Michael Kruppa, Norbert Reithinger and Günter Neumann (2009) DILIA - A DIGITAL LIBRARY ASSISTANT: A new approach to information discovery through information extraction and visualization. IC3K - The International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Madeira, Portugal.
This paper presents preliminary results of our current research project DiLiA (Digital Library Assistant). The goals of the project are are twofold. One goal of the project is the development of domain-independent information extraction methods. The other goal is the development of information visualization methods that interactively support researchers at time consuming information discovery tasks. We first describe issues that contribute to high cognitive load during exploration of unfamiliar research domains. Then we present a domain-independent approach to technical term extraction from paper abstracts, describe the architecture of the DiLiA, and illustrate an example co-author network visualization.
- Kathrin Eichler, Holmer Hemsen and Günter Neumann (2009) Unsupervised and domain-independent extraction of technical terms from scientific articles in digital libraries. Workshop Information Retrieval 2009 organized as part of LWA, Darmstadt, 2009.
A central issue for making the contents of documents in a digital library accessible to the user is the identification and extraction of technical terms. We propose a method to solve this task in an unsupervised, domain-independent way: We use a nominal group chunker to extract term candidates and select the technical terms from these candidates based on string frequencies retrieved using the MSN search engine.
- Rui Wang and Günter Neumann (2008) Adapting QA Components to Mine Answers in Speech Transcripts. Advances in Multilingual and Multimodal Information Retrieval Lecture Notes in Computer Science Volume 5152, 2008, pp 410-413.
The paper describes QAst-v1 a robust question answering system for answering factoid questions in manual and automatic transcriptions of speech. The system is an adaptation of our text–based crosslingual open–domain QA system that we used for the CLEF main tasks.
- Rui Wang and Günter Neumann (2008) An Accuracy-Oriented Divide-and-Conquer Strategy for Recognizing Textual Entailment. First Text Analysis Conference, Gaithersburg, MD, USA.
This paper describes our participation of the Recognizing Textual Entailment challenge this year. Based on our promising results in the RTE-3 challenge last year (66.9% of accuracy) using a precision-oriented puristic syntactic approach (puristic in the sense that we only performed dependency parsing), we explored further extensions of this perspective. By extension, we developed more specialized RTE-modules to tackle more cases (i.e., entailment pairs) while trying to keep high accuracy.
Our results ranked third best in the TAC 2008 competition.- Benjamin Adrian, Günter Neumann, Alexander Troussov and Borislav Popov (2008) OBIES 2008 - Ontology-based Information Extraction Systems. Editors of the International and KI-08 Workshop on Ontology-based Information Extraction Systems (OBIES 2008), Kaiserslautern, Germany, September 23, 2008.
More and more information extraction (IE) systems use ontologies for extraction tasks. These systems use knowledge representation techniques for extracting information from unstructured or semi-structured domains more efficiently. The advantages of these procedures are especially an increase of quality in IE-templates, reusability, and maintainability. Ontologies in IE may provide new techniques for supporting open tasks of semantic analyses regarding for instance temporal analyses, resolution of contradiction, or context awareness. There are several open research topics about ontology-based information extraction, for instance a proven architecture, evaluation guidelines regarding the use of ontologies, or ontologies vs. templates.
This volume contains the papers presented at OBIES 2008: 1st Workshop on Ontology-based Information Extraction Systems held on the 31st edition of the Annual German Conference on Artificial Intelligence (KI 2008)in Kaiserslautern.
There were 5 submissions. Each submission was reviewed by at least 3, and on the average 3.2, program committee members. The committee decided to accept 4 papers.
- Günter Neumann (2008) A Computational Linguistics Perspective on the Anticipatory Drive. Commentary on the target article by Martin V. Butz. Journal Constructivist Foundations, November 2008, Vol. 4., No. 1, 26-28.
In this commentary to Martin V. Butz' s target article I am especially concerned with his remarks about language (§33, §§71-79, §91) and modularity (§32, §41, §48, §81, §§94-98). In that context, I would like to bring into discussion my own work on computational models of self-monitoring (cf. Neumann 1998, 2004). In this work I explore the idea of an anticipatory drive as a substantial control device for modeling high-level complex language processes such as self-monitoring and adaptive language use. My work is grounded in computational linguistics and, as such, uses a mathematical and computational methodology. Nevertheless, it might provide some interesting aspects and perspectives for constructivism in general, and the model proposed in Butz' s article.
- Cabrio, E., Kouylekov, M., Magnini, B., Negri, M., Hasler, L., Orăsan, C., Tomás, D., Vicedo, J. L., Neumann, G. and Weber, C. (2008) The QALL-ME Benchmark: a Multilingual Resource of Annotated Spoken Requests for Question Answering. LREC 2008, Marrakesh, Morocco.
This paper presents the QALL-ME benchmark, a multilingual resource of annotated spoken requests in the tourism domain, freely available for research purposes. The languages currently involved in the project are Italian, English, Spanish and German. It introduces a semantic annotation scheme for spoken information access requests, specifically derived from Question Answering (QA) research. In addition to pragmatic and semantic annotations, we propose three QA-based annotation levels: the Expected Answer Type, the Expected Answer Quantifier and the Question Topical Target of a request, to fully capture the content of a request and extract the sought-after information. The QALL-ME benchmark is developed under the EU-FP6 QALL-ME project which aims at the realization of a shared and distributed infrastructure for Question Answering (QA) systems on mobile devices (e.g. mobile phones). Questions are formulated by the users in free natural language input, and the system returns the actual sequence of words which constitutes the answer from a collection of information sources (e.g. documents, databases). Within this framework, the benchmark has the twofold purpose of training machine learning based applications for QA, and testing their actual performance with a rapid turnaround in controlled laboratory setting.
- Alejandro Figueroa and Günter Neumann (2008) Genetic Algorithms for data-driven Web Question Answering. (Draft version) Journal Evolutionary Computation, Spring 2008, Vol. 16, No. 1: 127–147.
We present an evolutionary approach for the computation of exact answers to Natural Languages (NL) questions. Answers are extracted directly from the N–best snippets, which have been identified by a standard web search engine using NL questions. The core idea of our evolutionary approach to web question answering is to search for those substrings in the snippets, whose contexts are most similar to contexts of already known answers. This context model together with the words mentioned in the NL question are used to evaluate the fitness of answer candidates, which are actually randomly selected substrings from randomly selected sentences of the snippets. New answer candidates are then created by applying specialized operators for crossover and mutation, which either stretch and shrink the substring of an answer candidate or transpose the span to new sentences. Since we have no predefined notion of patterns, our context alignment methods are very dynamic and strictly data-driven. We assessed our system with seven different data sets of question/answer pairs. The results show that this approach is promising, especially when it deals with specific questions.
- Alejandro Figueroa and Günter Neumann (2008) Finding Distinct Answers in Web Snippets. Fourth International Conference on Web Information Systems and Technologies, Pages 26-33, Funchal, Madeira, Portugal, INSTICC Press.
This paper presents ListWebQA, a question answering system aimed specifically at discovering answers to list questions in web snippets. ListWebQA retrieves snippets likely to contain answers by means of a query rewriting strategy, and extracts answers according to their syntactic and semantic similarities afterwards. These similarities are determined by means of a set of surface syntactic patterns and a Latent Semantic Kernel. Results show that our strategy is effective in strengthening current web question answering techniques.
- Kathrin Eichler, Holmer Hemsen and Günter Neumann (2008) Unsupervised Relation Extraction from Web Documents. LREC 2008, Marrakesh, Morocco.
The IDEX system is a prototype of an interactive dynamic Information Extraction (IE) system. A user of the system expresses an information request for a topic description which is used for an initial search in order to retrieve a relevant set of documents. On basis of this set of documents unsupervised relation extraction and clustering is done by the system. The results of these operations can then be interactively inspected by the user. In this paper we describe the relation extraction and clustering components of the IDEX system. Preliminary evaluation results of these components are presented and an overview is given of possible enhancements to improve the relation extraction and clustering components.
- Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann and Norbert Reithinger (2008) Interactive Dynamic Information Extraction. Annual German Conference on Artificial Intelligence, Kaiserslautern, Germany.
The IDEX system is a prototype of an interactive dynamic Information Extraction (IE) system. A user of the system expresses an information request for a topic description which is used for an initial search in order to retrieve a relevant set of documents. On basis of this set of documents unsupervised relation extraction and clustering is done by the system. In contrast to most of the current IE systems the IDEX system is domain-independent and tightly integrates a GUI for interactive exploration of the extraction space.
- Günter Neumann (2008) Strategien zur Webbasierten Multilingualen Fragebeantwortung - Wie Suchmaschinen zu Antwortmaschinen werden. Journal Computer Science - Research and Development, Volume 22, Number 2 / Februar 2008, pages 71-84.
We present a series of innovative methods for a web-based multilingual question answering in open domains. In particular, we present novel strategies for the determination of optimal answer contexts and for the extraction of exact answers on the basis of language-independent, data-driven Machine Learning algorithms. Two alternative methods for the cross-lingual question analysis are presented that are used for finding answers in documents of one natural language using a query formulated in another natural language. All methods are evaluated in detail and demonstrate a promising performance.
- Alexander Volokh and Günter Neumann (2008) A Puristic Approach for Joint Dependency Parsing and Semantic Role Labeling. CoNLL-2008 shared task, Manchester, UK.
We present a puristic approach for combining dependency parsing and semantic role labeling. In a first step, a data-driven strict incremental deterministic parser is used to compute a single syntactic de- pendency structure using a MEM trained on the syntactic part of the CoNLL 2008 training corpus. In a second step, a cascade of MEMs is used to identify predicates, and, for each found predicate, to identify its arguments and their types. All the MEMs used here are trained only with labeled data from the CoNLL 2008 corpus. We participated in the closed challenge, and obtained a labeled macro F1 for WSJ+Brown of 19.93 (20.13 on WSJ only, 18.14 on Brown). For the syntactic dependencies we got similar bad results (WSJ+Brown=16.25, WSJ= 16.22, Brown=16.47), as well as for the semantic dependencies (WSJ+Brown=22.36, WSJ=22.86, Brown=17.94). The current results of the experiments suggest that our risky puristic approach of following a strict incremental parsing approach together with the closed data-driven perspective of a joined syntactic and semantic labeling was actually too optimistic and eventually too puristic.
- Rui Wang and Günter Neumann (2008) Relation Validation via Textual Entailment. ONTOLOGY-BASED INFORMATION EXTRACTION SYSTEMS (OBIES 2008), KI-2008, Sept. 2008.
This paper addresses a subtask of relation extraction, namely Relation Validation. Relation validation can be described as follows: given an instance of a relation and a relevant text fragment, the system is asked to decide whether this instance is true or not. Instead of following the common approaches of using statistical or context features directly, we propose a method based on textual entailment (called ReVaS). We set up two different experiments to test our system: one is based on an annotated data set; the other is based on real web data via the integration of ReVaS with an existing IE system. For the latter case, we examine in detail the two aspects of the validation process, i.e. directionality and strictness. The results suggest that textual entailment is a feasible way for the relation validation task.
- Bogdan Sacaleanu, Günter Neumann and Christian Spurk (2008) DFKI at QA@CLEF 2008. Working Notes for the CLEF 2008 Workshop, 17-19 September, Aarhus, Denmark.
This Working Note shortly presents QUANTICO, a cross-language open domain question answering system for German and English document collections. The main features of the system are: use of preemptive off-line document annotation with information like Named Entities, sentence boundaries and pronominal anaphora resolution; online extraction of abbreviation-extension pairs and appositional constructions for the answer extraction; use of online translation services for the cross-language scenarios and of English as interlingua for language combinations not supported directly; use of redundancy as an indicator of good answer candidates; selection of the best answers based on distance metrics defined over graph representations. Based on the question type two different strategies of answer extraction are triggered: for factoid questions answers are extracted from best IR-matched passages and selected by their redundancy and distance to the question keywords; for definition questions answers are considered to be either the first sentence of description paragraphs in Wikipedia documents or the most redundant normalized linguistic structures with explanatory role (i.e., appositions, abbreviation’s extensions). The results of evaluating the system’s performance by QA@CLEF 2008 were as follows: for the German-German run we achieved a best overall accuracy (ACC) of 37%; for the English-German run 14.5% (ACC); and for the German-English run 14% (ACC).
We achieved best results for monolingual German and second best for (cross-lingual) English as target.- Rui Wang and Günter Neumann (2008) Information Synthesis for Answer Validation. Working Notes for the CLEF 2008 Workshop, 17-19 September, Aarhus, Denmark.
This report is about our participation in the Answer Validation Exercise (AVE2008). Our system casts the AVE task into a Recognizing Textual Entailment (RTE) problem and uses an existing RTE system to validate answers. Additional information from named-entity (NE) recognizer, question analysis component, and so on, is also considered as assistances to make the final decision. In all, we have submitted two runs, one run for English and the other for German. They have achieved f-measures of 0.64 and 0.61 respectively. Compared with our system last year, which purely depends on the output of the RTE system, the extra information does show its effectiveness.
We achieved best results for English and German.- Rui Wang and Günter Neumann (2008) Ontology-based Query Construction for GeoCLEF. Working Notes for the CLEF 2008 Workshop, 17-19 September, Aarhus, Denmark.
This paper describes our participation in GeoCLEF. Being different from the traditional information retrieval, we focus more on the query expansion instead of document ranking. We parse each topic into the event part and the geographic part and use different ontologies to expand both parts respectively. The results show great advantages of our strategy for this task.
We achieved best results for English.- Alejandro Figueroa and Günter Neumann (2007) Identifying Protein-Protein interactions in Biomedical publications. Second BioCreAtIvE Challenge Evaluation Workshop - Critical Assessment of Information Extraction in Molecular Biology, Fundacion CNIO Carlos III, 2007, ISBN 84-933255-6-2, p. 217-225.
The paper describes the approaches and the results of our participation in the protein-protein interaction (PPI) extraction task (sub-tasks 1 to 3) of the BioCreative II challenge.1 The core of our approach is to analyze the logical forms of those sentences which contain the mentioning of relevant protein names, and to rank the sentences from which the relations where extracted using the class descriptors computed in the sub-task 1 and interaction sentences from the Christine Brun corpus.
- Alejandro Figueroa and Günter Neumann (2007) A Multilingual Framework for Searching Definitions on Web Snippets. Advances in Artificial Intelligence, LNCS, Volume 4667/2007, ISBN978-3-540-74564-8, p. 144-159.
This work presents Mdef-WQA, a system that searches for answers to definition questions in several languages on web snippets. For this purpose, Mdef-WQA biases the search engine in favor of some syntactic structures that often convey definitions. Once descriptive sentences are identified, Mdef-WQA clusters them by potential senses and presents the most relevant phrases of each potential sense to the user. The approach was assessed with TREC and CLEF data. As a result, Mdef-WQA was able to extract descriptive information for all definition questions in the TREC 2001 and 2003 data-sets.
- Andrea Heyl and Günter Neumann (2007) An Information Extraction based Approach to People Disambiguation. Fourth International Workshop on Semantic Evaluations (SemEval-2007), Co-Located with (ACL-2007), June 23-24, Prague, Czech Republic, Pages 137-140.
We propose an IE based approach to people disambiguation. We assume the mentioning of NEs and the relational context of a person in the text to be important discriminating features in order to distinguish different people sharing a name.
- Rui Wang and Günter Neumann(2007) Recognizing Textual Entailment Using a Subsequence Kernel Method. Association for the Advancement of Artificial Intelligence (AAAI), Vancouver, Canada.
We present a novel approach to recognizing Textual Entailment. Structural features are constructed from abstract tree descriptions, which are automatically extracted from syntactic dependency trees. These features are then applied in a subsequence-kernel-based classifier to learn whether an entailment relation holds between two texts. Our method makes use of machine learning techniques using a limited data set, no external knowledge bases (e.g. WordNet), and no handcrafted inference rules. We achieve an accuracy of 74.5% for text pairs in the Information Extraction and Question Answering task, 63.6% for the RTE-2 test data, and 66.9% for the RET-3 test data.
- Rui Wang and Günter Neumann (2007) Recognizing Textual Entailment Using Sentence Similarity based on Dependency Tree Skeletons. ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, Prague, Czech Republic.
We present a novel approach to RTE that exploits a structure-oriented sentence representation followed by a similarity function. The structural features are automatically acquired from tree skeletons that are extracted and generalized from dependency trees. Our method makes use of a limited size of training data without any external knowledge bases (e.g. WordNet) or handcrafted inference rules. We have achieved an accuracy of 71.1% on the RTE-3 development set performing a 10-fold cross validation and 66.9% on the RTE-3 test data.
- Bogdan Sacaleanu, Günter Neumann and Christian Spurk (2007) DFKI-LT at QA@CLEF 2007. CLEF 2007 Working Papers, Budapest.
This Working note shortly presents QUANTICO, a cross-language open domain question answering system for German and English document collections. The main features of the system are: use of preemptive off-line document annotation with information like Named Entities, sentence boundaries and pronominal anaphora resolution; online extraction of abbreviation-extension pairs and appositional constructions for the answer extraction; use of online translation services for the cross-language scenarios and of English as interlingua for language combinations not supported directly; use of redundancy as an indicator of good answer candidates; selection of the best answers based on distance metrics defined over graph representations. Based on the question type two different strategies of answer extraction are triggered: for factoid questions answers are extracted from best IR-matched passages and selected by their redundancy and distance to the question keywords; for definition questions answers are considered to be the most redundant normalized linguistic structures with explanatory role (i.e., appositions, abbreviation’s extensions). The results of evaluating the system’s performance by QA@CLEF 2007 were as follows: for the German-German run we achieved an overall accuracy (ACC) of 30%; for the English-German run 18.5% (ACC); for the German-English run 7% (ACC), for the Spanish-English run 10% (ACC) and for the Portuguese-German run 7% (ACC).
- Rui Wang and Günter Neumann (2007) DFKI–LT at AVE 2007: Using Recognizing Textual Entailment for Answer Validation. CLEF 2007 Working Papers, Budapest.
This report is about our participation in the Answer Validation Exercise (AVE) 2007. Our system utilizes a Recognizing Textual Entailment (RTE) system as a component to validate answers. We first change the question and the answer into Hypothesis (H) and view the document as Text (T), in order to cast the AVE task into a RTE problem. Then, we use our RTE system to tell us whether the entailment relation holds between the documents (i.e. Ts) and question-answer pairs (i.e. Hs). Finally, we adapt the results for the AVE task. In all, we have submitted two runs and achieved f-measures of 0.46 and 0.55 respectively, which both outperform last year’s best result for English. After detailed error analysis, we have found that both the recall and the precision of our system could be improved in the future.
We achieved best result for English.- Rui Wang and Günter Neumann (2007) DFKI–LT at QAST 2007: Adapting QA Components to Mine Answers in Speech Transcripts. CLEF 2007 Working Papers, Budapest.
The paper describes QAst-v1 a robust question answering system for answering factoid questions in manual and automatic transcriptions of speech. Our system is an adaptation of our text–based cross-lingual open–domain QA system that we used for the Clef main tasks. In particular we assume that good answer candidates to factoid questions are named entities which are type–compatible with the expected answer type of the question. The main features of QAst-v1 are: use of preemptive off-line annotation of speech transcripts with sentence boundaries, chunk structures and named entities (NEs); construction of a full text search index using words and all found NEs; use of robust Wh-analysis component to determine shallow dependency structures, recognition of NEs, and expected answer type (EAT); use of EAT–driven retrieval of sentences and answer candidates; use of redundancy as an indicator of good answer candidates. The main focus of our effort was on the technical realization of a first QAST research prototype making use of as many of our existing QA components as possible. The results of evaluating the system’s performance by QAST 2007 were as follows: for subtask T1 (Question-Answering in manual transcriptions of lectures) we achieved an overall accuracy (ACC) of 15% and a mean reciprocal rank (MRR) of 0.17; for subtask T2 (Question-Answering in automatic transcriptions of lectures) we obtained 9% (ACC) and 0.09 (MRR).
- Abdelhadi Soudi, Antal Van den Bosch and Günter Neumann (2007) Arabic Computational Morphology: Knowledge-based and Empirical Methods - Overview and background. A. Soudi, A. van den Bosch, and G. Neumann (eds.), Arabic Computational Morphology: Knowledge-based and EmpiricalMethods, Springer, 2007, pages 3-14.
The morphology of Arabic poses special challenges to computational natural language processing systems. The exceptional degree of ambiguity in the writing system, the rich morphology, and the highly complex word formation process of roots and patterns all contribute to making computational approaches to Arabic very challenging. Indeed many computational linguists across the world have taken up this challenge over time, and we have been able to commit many of the researchers with a track record in this research area to contribute to this book. The book’s subtitle aims to reflect that widely different computational approaches to the Arabic morphological system have been proposed. These accounts fall into two main paradigms: the knowledge-based and the empirical. Since morphological knowledge plays an essential role in any higher-level understanding and processing of Arabic text, the book also features a part on the integration of Arabic morphology in larger applications, namely Information Retrieval (IR) and Machine Translation (MT).
- Abdelhadi Soudi, Antal Van den Bosch and Günter Neumann (2007) Arabic Computational Morphology: Knowledge-based and Empirical Methods. Book series: Text, Speech and Language Technology, Vol. 38, 2007, VIII, 308 p., Hardcover ISBN: 978-1-4020-6045-8.
The morphology of Arabic poses special challenges to computational natural language processing systems. The exceptional degree of ambiguity in the writing system, the rich morphology, and the highly complex word formation process of roots and patterns all contribute to making computational approaches to Arabic very challenging. Indeed many computational linguists across the world have taken up this challenge over time, and many of the researchers with a track record in this research area have contributed to this book.
The book’s subtitle aims to reflect that widely different computational approaches to the Arabic morphological system have been proposed. These accounts fall into two main paradigms: the knowledge-based and the empirical. Since morphological knowledge plays an essential role in any higher-level understanding and processing of Arabic text, the book also features a part on the role of Arabic morphology in larger applications, i.e. Information Retrieval (IR) and Machine Translation (MT).
- Bogdan Sacaleanu and Günter Neumann (2007) A Cross-Lingual German-English Framework for Open-Domain Question Answering. Evaluation of Multilingual and Multi-modal Information Retrieval, Revised Selected Papers, CLEF-2006, Springer-Verlag, LNCS (4730), Alicante, Spain, 20-22.
The paper describes QUANTICO, a cross-language open domain question answering system for German and English. The main features of the system are: use of preemptive off-line document annotation with syntactic information like chunk structures, apposition constructions and abbreviation-extension pairs for the passage retrieval; use of online translation services, language models and alignment methods for the cross-language scenarios; use of redundancy as an indicator of good answer candidates; selection of the best answers based on distance metrics defined over graph representations. Based on the question type two different strategies of answer extraction are triggered: for factoid questions answers are extracted from best IR-matched passages and selected by their redundancy and distance to the question keywords; for definition questions answers are considered to be the most redundant normalized linguistic structures with explanatory role (i.e., appositions, abbreviation’s extensions). The results of evaluating the system’s performance by CLEF were as follows: for the best German-German run we achieved an overall accuracy (ACC) of 42.33% and a mean reciprocal rank (MRR) of 0.45; for the best English-German run 32.98% (ACC) and 0.35 (MRR); for the German-English run 17.89% (ACC) and 0.17 (MRR).
We achieved best results for the monolingual German task and the English-German bilingual task.- Alejandro Figueroa and Günter Neumann (2007) Mining Web Snippets to Answer List Questions. Second International Workshop on Integrating AI and Data Mining (AIDM 2007).
This paper presents ListWebQA, a question answering system that is aimed specifically at extracting answers to list questions exclusively from web snippets. Answers are identified in web snippets by means of their semantic and syntactic similarities. Initial results show that they are a promising source of answers to list questions.
- Anupriya Ankolekar, Paul Buitelaar, Philipp Cimiano, Pascal Hitzler, Malte Kiesel, Markus Krötzsch, Holger Lewen, Günter Neumann, Michael Sintek, Tuvshintur Tserendorj and Rudi Studer (2006) SmartWeb: Mobile Access to the Semantic Web. International Semantic Web Conference, Athens, GA, USA.
We present the SmartWeb Demonstrator for multimodal and mobile querying of semantic resources and the open WWW. The end-user interface consists of a Pocket Data Assistant which accepts written or spoken questions as input and delivers answers based on a multitude of resources including a semantic knowledge base, semantically annotated online web services, and semi-automatically created knowledge from text-based web pages. If answers cannot be found using these structured resources, then the system returns answers based on linguistic query-answering techniques on the open WWW.
- Alejandro Figueroa and Günter Neumann (2006) Language Independent Answer Prediction from the Web. Fifth International Conference on Natural Language Processing (FinTal), August 23-25 in Turku, Finland.
This work presents a strategy that aims to extract and rank predicted answers from the web based on the eigenvalues of a specially designed matrix. This matrix models the strength of the syntactic relations between words by means of the frequency of their relative positions in sentences extracted from web snippets. We assess the rank of predicted answers by extracting answer candidates for three different kinds of questions. Due to the low dependence upon a particular language, we also apply our strategy to questions from four different languages: English, German, Spanish, and Portuguese.
- Günter Neumann (2006) A Hybrid Machine Learning Approach for Information Extraction from Free Texts. From Data and Information Analysis to Knowledge Engineering, Springer series: Studies in Classification, Data Analysis, and Knowledge Organization, pages 390-397, Springer-Verlag Berlin, Heidelberg, New-York.
We present a hybrid machine learning approach for information extraction from unstructured documents by integrating a learned classifier based on the Maximum Entropy Modeling (MEM), and a classifier based on our work on Data–Oriented Parsing (DOP). The hybrid behavior is achieved through a voting mechanism applied by an iterative tag–insertion algorithm. We have tested the method on a corpus of German newspaper articles about company turnover, and achieved 85.2% F-measure using the hybrid approach, compared to 79.3% for MEM and 51.9% for DOP when running them in isolation.
- Günter Neumann and Berthold Crysmann (2006) Exploring HPSG-based Treebanks for Probabilistic Parsing. Fifth International Conference on Language Resources and Evaluation (LREC), Genoa, Italy.
We describe a method for the automatic extraction of a Stochastic Lexicalized Tree Insertion Grammar from a linguistically rich HPSG Treebank. The extraction method is strongly guided by HPSG–based head and argument decomposition rules. The tree anchors correspond to lexical labels encoding fine–grained information. The approach has been tested with a German corpus achieving a labeled recall of 77.33% and labeled precision of 78.27%, which is competitive to recent results reported for German parsing using the Negra Treebank.
- Günter Neumann and Bogdan Sacaleanu (2006) Experiments on Cross-Linguality and Question-type driven Strategy Selection for Open-Domain Question Answering. Multilingual Information Repositories, Revised Selected Papers, CLEF-2005, Springer-Verlag, LNCS (4022), Vienna, Austria, pp. 429-438.
We describe the extensions made to our 2004 QA@CLEF German/English QA-system, toward a fully German-English/English-German cross-language system with answer validation through web usage. Details concerning the processing of factoid, definition and temporal questions are given and the results obtained in the monolingual German, bilingual English-German and German-English tasks are briefly presented and discussed. We achieved best results for English and German as target languages.
- Bogdan Sacaleanu and Günter Neumann (2006) Cross-Cutting Aspects of Cross-Language Question Answering Systems. EACL workshop on Multilingual Question Answering - MLQA'06, Trento, Italy.
We describe on-going work in the development of a cross-language question answering framework for the open domain. An overview of the framework is being provided, some details on the important concepts of a flexible framework are presented and two cross-cutting aspects (cross-linguality and answer credibility) for question-answering systems are up for discussion
- Günter Neumann (2005) A Hybrid Machine Learning Approach to Information Extraction. Booklet of the 29th Annual Conference of the German Classification Society (GfKl 2005) - Special Track on Text Mining-, Magdeburg, March 2005.
- Günter Neumann and Bogdan Sacaleanu (2005) Experiments on Robust NL Question Interpretation and Multi-layered Document Annotation for a Cross-Language Question/Answering System. Fifth Workshop of the Cross-Language Evaluation Forum, CLEF 2004, Revised Selected Papers Series, Springer-Verlag, LNCS (3491), Bath, UK, pp. 411-422.
This report describes the work done by the QA group of the Language Technology Lab at DFKI, for the 2004 edition of the Cross-Language Evaluation Forum (CLEF). Based on the experience we obtained through our participation at QA@Clef-2003 with our initial cross-lingual QA prototype system BiQue, the focus of the system extension for this year’s task was a) on robust NL question interpretation using advanced linguistic-based components, b) flexible interface strategies to IR-search engines, and c) on strategies for off-line annotation of the data collection, which support query-specific indexing and answer selection. The overall architecture of the extended system, as well as the results obtained in the CLEF–2004 Monolingual German and Bilingual German/English QA tracks will be presented and discussed throughout the paper.
We achieved best results for English as target language.- Günter Neumann (2004) Domänenadaptive Kerntechnologien für die Analyse und Synthese von Natürlicher Sprache. Habilitationsschrift in Computational Linguistics, University of the Saarland, July, 2004.
On 14th July of 2004, I obtained the Venia Legendi (Habilitation - see Wikipedia entry) in “Computational Linguistics” with the topic “Domain-adaptive Core Technology for Natural Language Analysis and Generation” at the University of the Saarland, Germany. Committee: Prof. Dr. Pinkal, Prof. Dr. Uszkoreit, Prof. Dr. Wahlster (all: Univ. of the Saarland), Prof. Dr. Nerbonne (Univ. of Groningen). This paper gives a short summary in German about my major publications that I submitted as part of this academic procedure.
- Günter Neumann and Feiyu Xu (2004) Mining Natural Language Answers from the Web. Journal Web Intelligence and Agent Systems, Volume 2, Number 2, 2004, S. 123-135.
We present a novel method for mining textual answers in Web pages using semi-structured NL questions and Google for initial document retrieval. We exploit the redundancy on the Web by weighting all identified named entities (NEs) found in the relevant document set based on their occurrences and distributions. The ranked NEs are used as our primary anchors for document indexing, paragraph selection, and answer identification. The latter is dependent on two factors: the overlap of terms at different levels (e.g., tokens and named entities) between queries and sentences, and the relevance of identified NEs corresponding to the expected answer type. The set of answer candidates is further subdivided into ranked equivalent classes from which the final answer is selected. The system has been evaluated using question-answer pairs extracted from a popular German quiz book.
- Günter Neumann and Bogdan Sacaleanu (2004) A Cross-Language Question/Answering-System for German and English. Fourth Workshop of the Cross-Language Evaluation Forum, Revised papers, Springer-Verlag LNCS (3237), Trondheim, Norway, pp. 559-571.
This report describes the work done by the QA group of the Language Technology Lab at DFKI, for the 2003 edition of the Cross-Language Evaluation Forum (CLEF). We have participated in the new track “Multiple Language Question Answering (QAat-CLEF)” that offers tasks to test monolingual and cross-language QA–systems. In particular we developed an open–domain bilingual QA–System for German source language queries and English target document collections. Since it was our very first participation at such kind of competition, the focus was on system implementation rather than system tuning.
- Günter Neumann (2004) Informationsextraktion. Carstensen et al (eds): Computerlinguistik und Sprachtechnologie - Eine Einführung, Heidelberg: Spektrum-Verlag, ISBN 3827414075, second edition.
This is the second edition of the short eight page overview article about information extraction from Neumann (2001) adding only some more recent developments in the field of IE.
- Günter Neumann (2003) A Data-Driven Approach to Head-Driven Phrase Structure Grammar. Rens Bod, Remko Scha and Khahil Sima'an (eds): Data-Oriented Parsing, CSLI publications, Stanford, Ca, 2003, pp. 233-251.
We present HPSG–DOP, a method for automatically extracting a Stochastic Lexicalized Tree Grammar (SLTG) from a HPSG source grammar and a given corpus. Processing of a SLTG is performed by a specialized fast parser. The approach has been tested on a large English grammar and has been shown to achieve additional performance increase compared to parsing with a highly tuned HPSG parser. Our approach is simple and transparent. The extracted grammars are declaratively represented and have a high degree of practical applicability.
- Günter Neumann (2003) A Uniform Method for Automatically Extracting Stochastic Lexicalized Tree Grammars from Treebanks and HPSG. Anne Abeillé (ed) Treebanks - Building and using Parsed Corpora, Series: Text, Speech and Language Technology, VOL 20, 2003, 440 p., Softcover, ISBN: 978-1-4020-1335-5, pp. 351-365.
We present a uniform method for the extraction of stochastic lexicalized tree grammars (SLTG) of different complexities from existing treebanks as well as from competence-based grammars , which allows us to analyze the relationship of a grammar automatically induced from a treebank with respect to its size, its complexity, and its predictive power on unseen data. Processing of different SLTG is performed by a stochastic version of the two-step Early-based parsing strategy introduced in Schabes and Joshi, 1991.
- Günter Neumann, Feiyu Xu and Bogdan Sacaleanu (2003) Strategies for Web-based Cross-Language Question Answering. Second CoLogNET-ElsNET Symposium on Questions and Answers: Theoretical and Applied Perspectives, Amsterdam, 18 December, 2003, pp. 84-95.
In this paper we present our current state-of-the art in the development of a hybrid QA architecture. In particular we present a strategy for open–domain web–based QA, and a strategy for open domain cross–language QA. In both cases the focus is on processing fact-based questions and exact answer strategies using the Web as primarily document source. Both strategies are realized using the same core technology, and have been implemented for German and English queries and documents, and tested for German Web pages.
- Günter Neumann and Feiyu Xu (2003) Mining Answers in German Web Pages. The International Conference on Web Intelligence (WI 2003), Halifax, Canada.
We present a novel method for mining textual answers in German Web pages using semi-structured NL questions and Google for initial document retrieval. We exploit the redundancy on the Web by weighting all identified named entities (NEs) found in the relevant document set based on their occurrences and distributions. The ranked NEs are used as our primary anchors for document indexing, paragraph selection, and answer identification. The latter is dependent on two factors: the overlap of terms at different levels (e.g., tokens and named entities) between queries and sentences, and the relevance of identified NEs corresponding to the expected answer type. The set of answer candidates is further subdivided into ranked equivalent classes from which the final answer is selected. The system has been evaluated using question answer pairs extracted from a popular German quiz book.
- Berthold Crysmann, Anette Frank, Bernd Kiefer, Hans-Ulrich Krieger, Stephan Müller, Günter Neumann, Jakub Piskorski, Ulrich Schäfer, Melanie Siegel, Hans Uszkoreit and Feiyu Xu (2002) An Integrated Architecture for Shallow and Deep Processing. Association for Computational Linguistics 40th Anniversary Meeting (ACL), University of Pennsylvania, Philadelphia, USA.
We present a flexible architecture for the integration of shallow and deep NLP components which is aimed at flexible combination of different language technologies for a range of practical current and future applications. In particular, we describe the integration of a high-level HPSG parsing system with different high-performance shallow components, ranging from named entity recognition to chunk parsing and shallow clause recognition. The NLP components enrich a representation of natural language text with layers of new XML meta-information using a single shared data structure, called the text chart. We describe details of the integration methods, and show how information extraction and language checking applications for real-world German text benefit from a deep grammatical analysis.
- Alexander Mädche, Günter Neumann and Steffen Staab (2002) Bootstrapping an Ontology-based Information Extraction System. Szczepaniak, Piotr S.; Segovia, Javier; Kacprzyk, Janusz; Zadeh, Lotfi A. (eds), Intelligent Exploration of the Web, Springer, ISBN 3-7908-1529-2, 2002, pages 345-360.
Automatic intelligent web exploration will benefit from shallow information extraction techniques if the latter can be brought to work within many different domains. The major bottleneck for this, however, lies in the so far difficult and expensive modeling of lexical knowledge, extraction rules, and an ontology that together define the information extraction system. In this paper we present a bootstrapping approach that allows for the fast creation of an ontology-based information extracting system relying on several basic components, viz. a core information extraction system, an ontology engineering environment and an inference engine. We make extensive use of machine learning techniques to support the semi-automatic, incremental bootstrapping of the domain-specific target information extraction system.
- Günter Neumann (2002) Programming Languages in Artificial Intelligence. Bidgoli (ed) Encyclopedia of Information Systems, Academic Press, San Diego, CA, ISBN 0-12-227240-4, 2002, pages 31-45.
This is a short introduction into programming language mainly used in Artificial Intelligence, i.e., Lisp, Prolog and extension of them.
- Günter Neumann and Dan Flickinger (2002) HPSG-DOP: data-oriented parsing with HPSG. Ninth International Conference on HPSG (HPSG-2002), Kyung Hee University in Seoul, South Korea.
In this paper we apply the idea of data–oriented approaches for achieving domain adaptation as an alternative approach to improve efficient processing of HPSG grammars. The basic idea of HPSG–DOP is to parse all sentences of a representative training corpus using an HPSG grammar and parser in order to automatically acquire from the parsing results a stochastic lexicalized tree grammar (SLTG) such that each resulting parse tree is recursively decomposed into a set of subtrees. The decomposition operation is guided by the head feature principle of HPSG. Each extracted tree is automatically lexically anchored and each node label of the extracted tree compactly represents a set of relevant features by means of a simple symbol.
- Günter Neumann and Michael Kappes (2002) A simple base-line text-categorizer for evaluating the effect of feature extraction in text mining applications. Abstract booklet of the 26th Annual Conference of the German Classification Society (GfKl 2002), July 22-24, 2002, University of Mannheim, Germany.
More details of the underlying approach, implementation, and experiments on standard English corpora like the Reuters corpus, but also experiments on additional German corpora from press releases, as well as detailed examples of system runs, can be found in this substantially extended technical report.
- Günter Neumann and Jakub Piskorski (2002) A Shallow Text Processing Core Engine. Journal Computational Intelligence, Volume 18, Number 3, 2002, pages 451-476.
In this paper we present SPPC, a high-performance system for intelligent extraction of structured data from free text documents. SPPC consists of a set of domain-adaptive shallow core components that are realized by means of cascaded weighted finite state machines and generic dynamic tries. The system has been fully implemented for German; it includes morphological and on-line compound analysis, efficient POS-filtering, high performance named entity recognition and chunk parsing based on a novel divide and-conquer strategy. The whole approach proved to be very useful for processing free word order languages like German. SPPC has a good performance (more than 6000 words per second on standard PC environments) and achieves high linguistic coverage, especially for the divide-and-conquer parsing strategy, where we obtained an f-measure of 87.14% on unseen data.
- Günter Neumann and Ulrich Schäfer (2002) WHITEBOARD - Eine XML-basierte Architektur für die Analyse natürlichsprachlicher Texte. European Congress Fair for Technical Communication, Düsseldorf, Germany, 2002, pages 635.01-635.12.
Ziel des WHITEBOARD-Projektes ist die Entwicklung, Implementierung und Evaluation einer neuartigen Systemarchitektur, welche die Kombination von Sprachtechnologien für praktische Anwendungen erlaubt. Sprachtechnologien bieten verschiedenartige Möglichkeiten für eine partielle Analyse von Texten, die für Information Retrieval, Information extraction, language checking und viele weitere Anwendungen genutzt werden können. Die Verarbeitungsmethoden und -werkzeuge unterscheiden sich in vielerlei Dimensionen, zum Beispiel bezüglich der Ebenen linguistischer Beschreibung, der Tiefe der Analyse oder der Art, in der Wissen abgeleitet wird (linguistisch oder statistisch). Die Funktionalität der Methoden ist häufig überlappend, sie unterscheiden sich jedoch in ihren Stärken und Schwächen. Eine der schwierigsten Aufgaben der Sprachverarbeitung ist die Suche nach optimalen Kombinationen heterogener Techniken und Verarbeitungskomponenten – die Herausforderung für das WHITEBOARD-Projekt.
- Günter Neumann and Sven Schmeier (2002) Shallow Natural Language Technology and Text Mining. Journal Künstliche Intelligenz (KI), the German Journal on Artificial Intelligence, Special Issue on Text Mining, T. Joachims, E. Leopold (Eds.), ISSN: 0933-1875, 23-27.
At the Language Technology Lab of DFKI we are developing advanced robust and efficient methods and components for free NL text processing which are suitable for data-intensive applications like Text Mining, Information Extraction, or intelligent search engines. In this paper we will present a short overview of some of the core components, and how they have been used together with well-known Machine Learning tools as part of two application projects in the area of Text Mining, especially Text Classification.
- Melanie Siegel, Feiyu Xu and Günter Neumann (2001) Customizing GermaNet for the Use in Deep Linguistic Processing . Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations. 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL '01), June 2-7, Pages 165-167, Pittsburgh, Philadelphia, USA
In this paper we show an approach to the customization of GermaNet to the German HPSG grammar lexicon developed in the Verbmobil project. GermaNet has a broad coverage of the German base vocabulary and fine grained semantic classification, while the HPSG grammar lexicon is comparatively small und has a coarse-grained semantic classification. In our approach, we have developed a mapping algorithm to relate the synsets in GermaNet with the semantic sorts in HPSG. The evaluation result shows that this approach is useful for the lexical extension of our deep grammar development to cope with real-world text understanding.
- Thierry Declerck and Günter Neumann (2001) A Cascaded Shallow Approach to Reference Resolution. EuroConference on Recent Advances in NLP, RANLP-2001, Tzigov Chark, Bulgaria.
In this Paper we present the design of a reference resolution module (RM) within an Information Extraction (IE) system for German free text. We show in some details how such a module can be distributed over sub-components of a shallow Natural Language (NL) processing chain, defining a modular approach to reference resolution.
- Meike Klettke, Mathias Bietz, Ilvio Bruder, Andreas Heuer, Denny Priebe, Günter Neumann, Markus Becker, Jochen Bedersdorfer, Hans Uszkoreit, Alexander Mädche, Steffen Staab and Rudi Studer (2001) GETESS - Ontologien, objektrelationale Datenbanken und Textanalyse als Bausteine einer semantischen Suchmaschine. Journal Datenbank-Spektrum, Vol.1, No.1, 14-24.
In diesem Artikel wird dargestellt, wie Verfahren aus der Wissensrepräsentation, der Computerlinguistik, dem Information Retrieval und aus dem Bereich Datenbanken eingesetzt werden können, um für Suchmaschinen und in Dokumentenserver neue Funktionalitäten bereit- zustellen.
- Günter Neumann (2001) Informationsextraktion. Carstensen et al (eds): Computerlinguistik und Sprachtechnologie - Eine Einführung, Spektrum Akademischer Verlag, Heidelberg, 2001, ISBN 3-8274-1027-4.
This is a short eight page overview article about information extraction for the CL textbook. It is written in German.
- Günter Neumann and Thierry Declerck (2001) Domain Adaptive Information Extraction. ILT & CIP Workshop 2001, Shanghai.
We present in this paper the methodology developed within the PARADIME (Parameterizable Domain-Adaptive Information and Message Extraction) project for designing an Information Extraction (IE) system easily adaptable to new domains of application. For this we went for a strict separation of the (shallow) linguistic processing modules on the one hand and the domain-modeling modules on the other hand, thus looking for the maximal degree of reusability of common linguistic resources shared by all domains of application. The tools used for the domain-modeling allow a declarative description of the domain under consideration and a simple (abstract) mapping to the output of the Natural Language (NL) analysis, thus requiring only few and very general linguistic knowledge for the adaptation of the IE-system to new applications. We describe a real scale experiment on a fast adaptation cycle of the system to a new domain – the soccer domain – and present the first results obtained.
- Ilvio Bruder, Antje Düsterhöft, Markus Becker, Jochen Bedersdorfer and Günter Neumann (2001) GETESS: Constructing a Linguistic Search Index for an Internet Search Engine. International Conference on Applications of Natural Language to Information Systems (NLDB), Versailles, France, Revised Papers, Series: LNCS, 1959, Springer, Berlin, Heidelberg, 227-238.
In this paper, we illustrate how Internet documents can be automatically analyzed in order to capture the content of a document in a more detailed manner than usual. The result of the document analysis is called an abstract, and will it be used as a linguistic search index for the Internet search engine, GETESS. We show how the linguistic analysis system SMES can be used with a Harvest-based search engine for constructing a linguistic search index. Further, we denote how the linguistic index can be exploited for answering user search inquiries.
- Thierry Declerck and Günter Neumann (2000) Using a parameterisable and domain-adaptive information extraction system for annotating large-scale corpora? LREC'2000 Workshop Information Extraction meets Corpus Linguistics, Athena.
In this paper we describe a parameterizable and domain-adaptive Information Extraction (IE) system (for German texts) and present some ideas on how this kind of system could effectively support Corpus Linguistics (CL) tasks. We also tentatively address the complementary question and look in which sense corpus linguistics can be beneficial to IE, specially in the case of automatic learning of templates of interest for IE tasks, a topic which is crucial for the further development of highly flexible IE systems. We describe briefly some steps done for the adaptation of the IE system to a new domain in order to illustrate the points where in our opinion IE and CL go for a closer cooperation.
- Jakub Piskorski and Günter Neumann (2000) An Intelligent Text Extraction and Navigation System. Sixth International Conference on Computer-Assisted Information Retrieval (RIAO-2000), Paris.
We present SPPC, a high-performance system for intelligent text extraction and navigation from German free text documents. SPPC consists of a set of domain-independent shallow core components which are realized by means of cascaded weighted finite state machines and generic dynamic tries. All extracted information is represented uniformly in one data structure (called the text chart) in a highly compact and linked form in order to support indexing and navigation through the set of solutions. German text processing includes (among others) compound processing, high performance named entity recognition and chunk parsing based on a divide-and-conquer strategy. SPPC has a good performance (4380 words per second on standard PC environments) and high linguistic coverage.
- Günter Neumann, Christian Braun and Jakub Piskorski (2000) A Divide-and-Conquer Strategy for Shallow Parsing of German Free Texts. Sixth Applied Natural Language Processing Conference (ANLP), Seattle, Washington, 239-246.
We present a divide-and-conquer strategy based on finite state technology for shallow parsing of real- world German texts. In a first phase only the topological structure of a sentence (i.e., verb groups, subclauses) are determined. In a second phase the phrasal grammars are applied to the contents of the different fields of the main and sub-clauses. Shallow parsing is supported by suitably configured preprocessing, including: morphological and on-line com- pound analysis, efficient POS-filtering, and named entity recognition. The whole approach proved to be very useful for processing of free word order languages like German. Especially for the divide-and-conquer parsing strategy we obtained an f-measure of 87.14% on unseen data.
- Alexander Mädche, Günter Neumann and Steffen Staab (1999) A Generic Architectural Framework for Text Knowledge Acquisition. DFKI Technical Report, Saarbrücken, Germany.
As knowledge engineering matures and the amount of textual resources grows, the pressure for (semi-)automatic acquisition of knowledge structures from texts increases drastically. With automatic text understanding techniques still being far from perfect, the best solution one may offer is a framework that allows for flexible integration of a large variety of techniques into one text knowledge acquisition environment. Thus, we here present an architecture that integrates seminal, but rather isolated accounts for text knowledge acquisition into the MIKE knowledge engineering framework. Building on a survey of methods for text knowledge acquisition, we devise a scheme that lends itself to the acquisition problem, in general, and to the construction of natural-language applications, in particular.
- Günter Neumann and Dan Flickinger (1999) Learning Stochastic Lexicalized Tree Grammars from HPSG. DFKI Technical Report, Saarbrücken, Germany.
We present a method for automatically extracting a Stochastic Lexicalized Tree Grammar (SLTG) from an HPSG source grammar and a given corpus. Processing of a SLTG is performed by a specialized fast parser. The approach has been tested on a large English grammar and has been shown to achieve a speed-up by a factor of better than 10 compared to parsing with a highly tuned HPSG parser. Our approach is simple and transparent, and comes with no magic tuning strategies. The extracted grammars are declaratively represented and have a high degree of practical applicability.
- Günter Neumann and Giampollo Mazzini (1999) Domain adaptive information extraction. DFKI Technical Report, Saarbrücken, Germany.
We will present methods and technologies which support the development of information extraction (IE) systems which can rapidly be configured for new applications domains and tasks, i.e., IE systems which support a fast application development cycle. The core idea is to model the linkage of domain independent and domain knowledge on a much higher and abstract level than it is usually done in current IE systems without critical compromises wrt. robustness and efficiency.
- Günter Neumann and Sven Schmeier (1999) Combining Shallow Text Processing and Machine Learning in Real World Applications. IJCAI-99 workshop on Machine Learning for Information Filtering, Stockholm, Sweden.
We present first results we achieved and experiences we had combining shallow text processing methods with machine learning tools. In two research projects, where DFKI and industrial partners are involved, German real world texts have to be classified into several predefined categories. We will point out that decisions concerning questions such as how deep the texts have to be analyzed linguistically and how ML tools must be parameterized are highly domain and data dependent. On the other hand, there are some constants or heuristics which may show in the right direction for future applications.
- Steffen Staab, Christian Braun, Ilvio Bruder, Antje Düsterhöft, Andreas Heuer, Meike Klettke, Günter Neumann, Bernd Prager, Jan Pretzel, Hans-Peter Schnurr, Rudi Studer, Hans Uszkoreit and Burkhard Wrengers (1999) Getess - searching the Web exploiting German texts. Third workshop on Cooperative Information Agents, Series: LNCS 1652, Springer, Berlin, Heidelberg.
We present an intelligent information agent that uses semantic methods and natural language processing capabilities in order to gather tourist information from the WWW and present it to the human user in an intuitive, user-friendly way. Thereby, the information agent is designed such that as background knowledge and linguistic coverage increase, its benefits improve, while it guarantees state-of-the-art information and database retrieval capabilities as its bottom line.
- Steffen Staab, Christian Braun, Ilvio Bruder, Antje Düsterhöft, Andreas Heuer, Meike Klettke, Günter Neumann, Bernd Prager, Jan Pretzel, Hans-Peter Schnurr, Rudi Studer, Hans Uszkoreit and Burkhard Wrengers (1999) A System for Facilitating and Enhancing Web Search. International Working Conference on Artificial and Natural Neural Networks., Alicante, ES, Series: LNCS 1607, Springer, Berlin, Heidelberg.
We present a system that uses semantic methods and natural language processing capabilities in order to provide comprehensive and easy-to-use access to tourist information in the WWW. Thereby, the system is designed such that as background knowledge and linguistic coverage increase, the benefits of the system improve, while it guarantees state-of-the-art information and database retrieval capabilities as its bottom line.
- Thierry Declerck, Judith Klein and Günter Neumann (1998) Evaluation of the NLP Components of an Information Extraction System for German. First International Conference on Language Resources and Evaluation (LREC), Granada, 293-297.
This paper describes ongoing work on the evaluation of the NLP components of the core engine of SMES (Saarbrücker Message Extraction System), which consists of a tokenizer, an efficient and robust German morphology, a part-of-speech (POS) tagger, a parsing module, a linguistic knowledge base and an output construction component. Currently the morphology, the tagger and parsing module (NP grammar) are under evaluation, at distinct degrees of progress. We present the methodology used and the results obtained so far.
- Judith Klein, Thierry Declerck and Günter Neumann (1998) Evaluation of the Syntactic Analysis Component of an Information Extraction System for German. Workshop on the Evaluation of Parsing Systems (LREC 1998), Granada, Spain.
This paper describes two distinct evaluation initiatives on the NLP performance of SMES (Saarbrücker Message Extraction System). The first study aimed at the improvement of SMES used as analysis component of a German dialogue system for appointment scheduling. While the first scenario was mainly designed to test the overall analysis result, the second is concerned with a systematic evaluation of the NLP components of the core engine of the SMES system. The morphological analysis, the disambiguation ability of the POS as well as one of the various parsing modules are now under evaluation. Suitable reference material and supporting evaluation is being developed in parallel.
- Günter Neumann (1998) Interleaving Natural Language Parsing and Generation Through Uniform Processing. Journal Artificial Intelligence 99, 121-163.
We present a new model of natural language processing in which natural language parsing and generation are strongly interleaved tasks. Interleaving of parsing and generation is important if we assume that natural language understanding and production are not only performed in isolation but also work together to obtain subsentential interactions in text revision or dialog systems.
The core of the model is a new uniform agenda-driven tabular algorithm, called UTA. Although uniformly defined, UTA is able to configure itself dynamically for either parsing or generation, because it is fully driven by the structure of the actual input—a string for parsing and a semantic expression for generation.
Efficient interleaving of parsing and generation is obtained through item sharing between parsing and generation. This novel processing strategy facilitates the automatic exchange of items (i.e., partial results) computed in one direction to the other direction as well.
The advantage of UTA in combination with the item sharing method is that we are able to extend the use of memorization techniques to the case of an interleaved approach. In order to demonstrate UTA's utility for developing high-level performance methods, we present a new algorithm for incremental self-monitoring during natural language production.
- Günter Neumann (1998) Automatic Extraction of Stochastic Lexicalized Tree Grammars from Treebanks. Fourth workshop on tree-adjoining grammars and related frameworks, Philadelphia, PA, USA.
We present a method for the extraction of stochastic lexicalized tree grammars (S-LTG) of different complexities from existing treebanks, which allows us to analyze the relationship of a grammar automatically induced from a treebank wrt. its size, its complexity, and its predictive power on unseen data. Processing of different S-LTG is performed by a stochastic version of the two-step Early-based parsing strategy introduced in [Schabes and Joshi, 1998].
- Gregor Erbach, Günter Neumann and Hans Uszkoreit (1997) MULINEX - Multilingual Indexing, Editing and Navigation Extensions for the World Wide Web. Cross-Language Text and Speech Retrieval - Papers from the 1997 AAAI Spring Symposium, Pages 22-28, Menlo Park, AAAI Press
This paper gives an overview of the project MULINEX, which is a "leading-edge application project" funded in the Telematics Application Program (Language Engineering Sector) of the European Union. The goal of the project is the development of a set of tools to allow cross- language text retrieval for the WWW, concept-based indexing, navigation tools and web-site management facilities for multilingual WWW sites. The project takes a user-centered approach in which the user needs drive the development activities and set the research agenda.
- Günter Neumann (1997) Applying Explanation-based Learning to Control and Speeding-up Natural Language Generation. 35th Annual Meeting of the Association for Computational Linguistics (ACL), Madrid, Spain.
This paper presents a method for the automatic extraction of sub-grammars to control and speeding-up natural language generation NLG. The method is based on explanation-based learning EBL. The main advantage for the proposed new method for NLG is that the complexity of the grammatical decision making process during NLG can be vastly reduced, because the EBL method supports the adaption of a NLG system to a particular use of a language.
- Günter Neumann, Rolf Backofen, Judith Baur, Markus Becker and Christian Braun (1997) An Information Extraction Core System for Real World German Text Processing. Fifth Applied Natural Language Processing (ANLP), Washington, USA.
This paper describes SMES, an information extraction core system for real world German text processing. The basic design criterion of the system is of providing a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be customized for processing different tasks in a flexible manner.
About the SMES software package
The complete system SMES including all components can be downloaded from SMES' download page.
- Günter Neumann (1997) Methoden zur intelligenten Informationsextraktion im Internet. European Congress Fair for Technical Communication, ONLINE '97, Hamburg.
In diesem Artikel stellen wir das IE-System SMES vor, ein an der DFKI GmbH entwickeltes IE-Kernsystem zur Verarbeitung speziell deutschsprachiger elektronischer Texte. Das zentrale Entwicklungsziel ist hierbei die einfache Portabilität und Adaptierung des Kernsystems für IE-Anwendungen unterschiedlicher Domänen undKomplexität. Wir konzentrieren uns in diesem Artikel auf die Beschreibung der zentralen Technologien und ihrer Implementation, die zur Erreichung der angestrebten Portabilität entwickelt wurden. Wir werden nur kurz auf einige auf der Basis des Kernsystems realisierten Anwendungssysteme eingehen (cf. Abschnitt 7).
- Günter Neumann (1997) An On-line Learning Method to Speed-up Natural Language Processing. Technical report, DFKI GmbH.
We present an online method of automatic sub-grammar extraction based on Explanation-based Learning, in which the source grammar processor and the sub-grammar processor are strictly interleaved allowing sub-sentential interactions. Informally, the new method can be considered as an intelligent storage unit of example-based generalized parts of the grammatical search space determined through training by the source grammar processor. Processing of similar new inputs is then reduced to simple but efficient lookup and matching operations, which circumvent re-computation of this already known search space. The method is uniform in the sense that it can be used for both, parsing and generation. It has fully been implemented and tested with broad-coverage grammars for English, German, and Japanese.
- Gertjan van Noord and Günter Neumann (1997) Syntactic Generation. Survey of the State of the Art in Human Language Technology, Cambridge University Press, Cambridge, UK.
In a natural language generation module, we often distinguish two components. On the one hand it needs to be decided what should be said. This task is delegated to a planning component. Such a component might produce an expression representing the content of the proposed utterance. On the basis of this representation the syntactic generation component produces the actual output sentence(s). Although the distinction between planning and syntactic generation is not uncontroversial, we will nonetheless assume such an architecture here, in order to explain some of the issues that arise in syntactic generation. A (natural language) grammar is a formal device that defines a relation between (natural language) utterances and their corresponding meanings. In practice this usually means that a grammar defines a relation between strings and logical forms. During natural language understanding, the task is to arrive at a logical form that corresponds to the input string. Syntactic generation can be described as the problem to find the corresponding string for an input logical form. We are thus making a distinction between the grammar which defines this relation, and the procedure that computes the relation on the basis of such a grammar. In the current state of the art unification-based (or more general: constraint-based) formalisms are used to express such grammars, e.g., Lexical Functional Grammar (LFG) [Bre82], Head-Driven Phrase-Structure Grammar (HPSG) [PS87] and constraint-based categorial frameworks (cf. [Usz86] and [ZKC87]). Almost all modern linguistic theories assume that a natural language grammar not only describes the correct sentences of a language, but that such a grammar also describes the corresponding semantic structures of the grammatical sentences. Given that a grammar specifies the relation between phonology and semantics it seems obvious that the generator is supposed to use this specification. For example, Generalized Phrase Structure Grammars (GPSG) [GKPS85] provide a detailed description of the semantic interpretation of the sentences licensed by the grammar. Thus one might assume that a generator based on GPSG constructs a sentence for a given semantic structure, according to the semantic interpretation rules of GPSG. Alternatively, [Bus90] presents a generator, based on GPSG, which does not take as its input a logical form, but rather some kind of control expression which merely instructs the grammatical component which rules of the grammar to apply. Similarly, in the conception of [GP90], a generator is provided with some kind of deep structure which can be interpreted as a control expression instructing the grammar which rules to apply. These approaches to the generation problem clearly solve some of the problems encountered in generation - simply by pushing the problem into the conceptual component (i.e., the planning component). In this overview we restrict the attention to the more ambitious approach sketched above. The success of the currently developed constraint-based theories is due to the fact that they are purely declarative. Hence, it is an interesting objective-theoretically and practically - to use one and the same grammar for natural language understanding and generation. In fact the potential for reversibility was a primary motivation for the introduction of Martin Kay's Functional Unification Grammar (FUG). In recent years interest in such a reversible architecture has led to a number of publications.
- Stephan Busemann, Stephan Oepen, Elisabeth Hinkelman, Günter Neumann and Hans Uszkoreit (1994) COSMA - Multi-Participant NL Interaction for Appointment Scheduling. Research report DFKI GmbH, RR-94-34, Germany.
We describe the COSMA (Cooperative Schedule Management Agent) system, a secretarial assistant for appointment scheduling. A central part of COSMA is the reusable NL core system DISCO, which serves, in the application, as an NL interface between an appointment planning system and the human user. COSMA is fully implemented in Common Lisp and runs on Unix Workstations. Our experience with COSMA shows that it is a plausible and useful application for NL systems. However, the appointment planner was not designed fro NL communication and thus makes strong assumptions about sequencing of domain actions and about the error-freeness of the communication. We suggest that further improvements of the overall COSMA functionality, especially with regards to flexibility and robustness, be based on a modified architecture.
- Robert Dale, Wolfgang Finkler, Richard Kittredge, Niels Lenke, Günter Neumann, Carola Peters and Manfred Stede (1994) Report from Working Group 2: Lexicalization and Architecture. Dagstuhl Seminar on Principles of Natural Language Generation, Dagstuhl Seminar Report 93, July 25-29.
This report summarises the results of the discussions held in Working Group 2: The group discussions focussed around three reasonably independent topics, and we have organized the report to reflect this.
- Günter Neumann (1994) Application of Explanation-based Learning for Efficient Processing of Constraint-based Grammars. Tenth Conference on Artificial Intelligence for Applications, San Antonio, Texas.
The paper describes the application of Explanation-based Learning for efficient processing of constraint-based grammars. The idea is to generalize the derivations of training instances created by normal parsing automatically and to use these generalized derivations (called templates) during the run-time mode of the system. In case a template can be instantiated for a new input, no further grammatical analysis is necessary. The approach is not restricted to sentential level but can be applied on arbitrary phrases. Therefore, the EBL method can be interleaved straightforwardly with normal processing to get back flexibility that otherwise would be lost.
- Günter Neumann (1994) A Uniform Computational Model for Natural Language Parsing and Generation. PhD thesis, Saarland University, Saarbrücken.
A novel uniform tabular algorithm is presented that can be used for efficient parsing and generation of constraint-based grammars without the need for compilation. The most important properties of the algorithm are:
- Earley deduction. The control logic of the new algorithm is based on Earley deduction, i.e., it realizes a mixed top-down/bottom-up behavior. The new algorithm is the first one that is able to apply this strategy for parsing and generation in a real symmetric and balanced way and consequently will terminate on a larger class of reversible grammars.
- Dynamic selection function. The uniform algorithm uses a dynamic selection function to determine the element to process next on the basis of the current portion of the input - a string for parsing and a semantic expression for generation. This enables us, for example, to obtain a left-to-right control regime in the case of parsing and a semantic functor driven regime in the case of generation when processing the same grammar by means of the same underlying algorithm.
- Uniform indexing technique. The same basic mechanism is used for parsing and generation, but parameterized with respect to the information used for indexing partial results. The kind of index causes completed information to be placed in the different state sets. Using this mechanism we can benefit from a table-driven view of generation, similar to that of parsing. For example, using a semantics-oriented indexing mechanism during generation massive redundancies are avoided, because once a phrase is generated, we are able to use it in any position within a sentence.
- Item sharing. We present a new method of grammatical processing which we term item sharing. The basic idea is that items computed in one direction are automatically made available for the other direction as well. The item sharing approach is based on the uniform indexing technique mentioned above and is realized as a straightforward extension of the uniform tabular algorithm. The relevance of this novel method is demonstrated when the new performance model is presented.
Since the only relevant parameter our uniform tabular algorithm has with respect to parsing and generation is the difference in input structures the basic differences between parsing and generation are simply the different input structures. This seems to be trivial, however our approach is the first uniform algorithm that is able to adapt dynamically to the data, achieving a maximal degree of uniformity for parsing and generation under a task-oriented view.
- Günter Neumann and Gertjan van Noord (1994) Reversibility and Self-Monitoring in Natural Language Generation. Reversible Grammars in Natural Language Processing, Kluwer, 59-95.
This paper shows how the use of reversible grammars may lead to efficient and flexible natural language parsing and generation systems. In particular a mechanism is described which ensures that only non-ambiguous utterances are produced. This mechanism uses the parsing component to monitor the generation component. The relevant communication between the two components is performed using derivation trees. For this reason the proposed mechanism only makes sense for systems in which a single grammar is used for both parsing and generation. Furthermore we define a variant of the monitoring strategy which can be used to paraphrase a given input sentence (for interactive disambiguation). In this case, the generation component is used to guide the parsing system. Again the proposed technique is possible only in the case of a single, reversible grammar.
- Hans Uszkoreit, Rolf Backofen, Stephan Busemann, Kader Diagne, Elisabeth Hinkelman, Walter Kasper, Bernd Kiefer, Hans-Ulrich Krieger, Klaus Netter, Günter Neumann, Stephan Oepen and Steven Spackman (1994) DISCO - An HPSG-based NLP System and its Application for Appointment Scheduling (Project Note). Fifteenth International Conference on Computational Linguistics (COLING-94), Kyoto, Japan, 463-440.
The natural language system DISCO is described. It combines a powerful and flexible grammar development system; linguistic competence for German including morphology, syntax and semantics; new methods for linguistic performance modeling on the basis of high-level competence grammars; new methods for modelling multi-agent dialogue competence; an interesting sample application for appointment scheduling and calendar management.
- Günter Neumann (1993) Natural Language Generation: Grammar Formalisms and Their Processing. German Journal on Artificial Intelligence, Special Issue on Natural Language Generation, W. Hoeppner (Ed.) FBO, Germany.
This essay provides an overview of the grammar formalisms currently in use and how they are processed in language production systems. The material in question includes constraint-based formalisms, systemic grammars and tree-adjoining grammars. With the aid of selected studies, we demonstrate how these formalisms are used in generation. In the final part, the problem of grammar reversibility is discussed, that is, the use of one and the same grammar for parsing and generation.
- Günter Neumann (1993) Design Principles of the Disco System. Twente Workshop on Language Technology (TWLT 5). Natural Language Interfaces, June 3-4, University of Twente, Enschede, The Netherlands.
In this paper we introduce the basic design principles of the DISCO system, A Natural Language analysis and generation system. In particular we describe the DISCO DEVELOPMENT SHELL, the basic tool for the integration of natural language components in the DISCO system, and its application in the COSMA (COoperative Schedule Management Agent) system.
Following an object-oriented architecture model we introduce a two-step approach, where in the first phase the architecture is developed independently of specific components to be used and of a particular flow of control. In the second phase the frame system is instantiated by the integration of existing components as well as by defining the particular flow of control between these components. Because of the object-oriented paradigm it is easy to augment the frame system, which increases the flexibility of the whole system with respect to new applications, The development of the COSMA system will serve as an example of this claim.
- Günter Neumann (1993) The DISCO development shell and its application in the cosma system. Workshop on Natural Language Systems: Modularity and Re-usability, DFKI GmbH, Germany, D-93-03.
This paper describes the DISCO DEVELOPMENT SHELL, which serves as a basic tool for the integration of natural language components in the DISCO project, and its application in the COSMA system, a Cooperative Schedule Management Agent. Following an object oriented architectural model we introduce a two-step approach, where in the first phase the architecture is developed independently of specific components to be used and of a particular flow of control. In the second phase the "frame system" is instantiated by the integration of existing components as well as by defining the particular flow of control between these components. Because of the object-oriented paradigm it is easy to augment the frame system, which increases the flexibility of the whole system with respect to new applications. The development of the COSMA system will serve as an example of this claim.
- Günter Neumann and Gertjan van Noord (1992) Self-Monitoring with Reversible Grammars. Fourteenth International Conference on Computational Linguistics (COLING-92), Nantes, France.
We describe a method and its implementation for self-monitoring during natural language generation. In situations of communication where the generation of ambiguous utterances should be avoided our method is able to compute an unambiguous utterance for a given semantic input. The proposed method is based on a very strict integration of parsing and generation. During the monitored generation step, a previously generated (possibly) ambiguous utterance is parsed and the obtained alternative derivation trees are used as a 'guide' for re-generating the utterance. To achieve such an integrated approach the underlying grammar must be reversible.
- Günter Neumann (1991) Reversibility and Modularity in Natural Language Generation. ACL Workshop on Reversible Grammar in Natural Language Processing, Berkeley, CA, 31-39.
A consequent use of reversible grammars within natural language generation systems has strong implications for the separation into strategic and tactical components. A central goal of this paper is to make plausible that a uniform architecture for grammatical Processing will serve as a basis to achieve more flexible and efficient generation systems.
- Günter Neumann (1991) A Bidirectional Model for Natural Language Processing. Fifth Conference of the European Chapter of the Association for Computational Linguistics (EACL), Berlin, Germany, 1991, 245-250.
In this paper I will argue for a model of grammatical processing that is based on uniform processing and knowledge sources. The main feature of this model is to view parsing and generation as two strongly interleaved tasks performed by a single parametrized deduction process. It will be shown that this view supports flexible and efficient natural language processing.
- Günter Neumann and Wolfgang Finkler (1990) A Head-Driven Approach to Incremental and Parallel Generation of Syntactic Structures. Thirteenth International Conference on Computational Linguistics (COLING-90), Helsinki, Finland, Europe, 288-293.
This paper describes the construction of syntactic structures within an incremental multi-level and parallel generation system. Incremental and parallel generation imposes special requirements upon syntactic description and processing. A head-driven grammar represented in a unification-based formalism is introduced which satisfies these demands. Furthermore the basic mechanisms for the parallel processing of syntactic segments are presented.
- Wolfgang Finkler and Günter Neumann (1989) POPEL-HOW: A Distributed Parallel Model for Incremental Natural Language Production with Feedback. 11th International Joint Conference on Artificial Intelligence (IJCAI), Detroit, MD, USA, Volume 2, 1518-1523.
This paper presents a new model for the production of natural language. The novel idea is to combine incremental and bidirectional generation with parallelism. The operational basis of our model is a distributed parallel system at every level of representation. Starting point of the production are segments of the conceptual level. These segments are related to active objects which have to map themselves across the linguistic levels (i.e. functional-semantic, syntactic and morphologic level) as fast and as independently as possible. Therefore the incremental behavior caused by a successive input is propagated on all levels. Linguistic requirements detected by objects which are related to already produced segments at any level influence and restrict the decision of what to say next.
- Günter Neumann (1989) POPEL-HOW: Parallele, inkrementelle Generierung natürlichsprachlicher Sätze aus konzeptuellen Einheiten. Diploma Thesis, Computer Science, Saarland University, Saarbrücken, Germany.
Die vorliegende Arbeit ist im Rahmen des Sonderforschungsbereichs 314, Teilprojekt N1: XTRA (eXpert TRAnslator), einem natürlichsprachlichen Zugangssytem zu Expertensystemen entstanden. Eine wichtige Aufgabe innerhalb dieses Projektes ist der Entwurf der Generierungskomponente POPEL (Production Of {Perhaps, Possibly, ...} Eloquent Language), die die dialogspezifische Erzeugung sprachlicher Ausdrücke innerhalb von XTRA übernimmt. Die zwei zentralen Aufgaben hierbei sind: 1) Festlegung, was gesagt und 2) wie es gesagt werden soll.
Beide Aufgabenbereiche werden in POPEL durch voneinander getrennte, aber miteinander interagierende Komponenten - POPEL-WHAT und POPEL-HOW - bearbeitet. Der Kommunikationsfluß zwischen diesen Komponenten ist bidirektional und erlaubt insbesondere einen inkrementellen und auf Wechselwirkung basierenden Generierungsprozeß zwischen Inhaltsfestlegung und - realisierung. Daraus und aus der Einbettung von POPEL in ein Dialogsystem ergeben sich für die Architektur und die Funktionsweise der inhaltsrealisierenden Komponente POPEL-HOW folgende spezielle Anforderungen:
- Es sollen möglichst viele Wissensquellen des Gesamtsystems eingesetzt werden.
- Es müssen dialogspezifische Äußerungen erzeugt werden.
- Die Ausgangsstrukturen sind Teilstrukturen oder Segmente der konzeptuellen Beschreibungsebene des Gesamtsystems.
- Der inkrementelle Generierungsproze erfordert es, daß die Segmente so schnell wie möglich über alle Darstellungsebenen des Systems (konzeptuelle, semantische, syntaktische und morphologische Ebene) transformiert werden, damit sie unmittelbar geäußert werden können.
- Da die erzeugten Teilstrukturen unterspezifiziert sein können - womit die rasche Abbildung gefährdet wäre - muß die fehlende Information von der übergeordneten Ebene angefordert werden können.
- Daraus folgt: Der Generierungsprozeß in POPEL-HOW ist auf jeder Darstellungsebene inkrementell und bidirektional.
- Insbesondere sind Rückwirkungen auf die inhaltsfestlegende Komponente POPEL-WHAT möglich, d.h. POPEL-HOW muß gezielt mit POPEL-WHAT kommunizieren können.
Die besondere Art des Generierungsprozesses - inkrementell und bidirektional - legt es nahe, als operationale Basis auf jeder Ebene ein verteiltes paralleles Modell einzusetzen. Dies erlaubt es, die einzelnen Segmente kooperierenden Prozessen zuzuordnen. Jeder Prozeß ist dann dafür verantwortlich, die Verbalisierung seines Segments durchzuführen. Dies setzt aber voraus, da die verwendeten Wissensquellen sinnvoll segmentiert, verteilt und miteinander in Beziehung gebracht werden können.
POPEL-HOW ist das erste inhaltsrealisierende System, das einen inkrementellen und bidirektionalen Verbaliserungsprozeß über mehrere Ebenen durchführt und als operationale Basis ein verteiltes paralleles Modell verwendet. Bisherige inkrementelle Systeme, wie z.B. [Kempen:88] modellieren keinen bidirektionalen Kommunikationsfluß oder behandeln inkrementelle Verbalisierung (vgl. [Appelt:85], [Hovy:87]) für die inhaltsrealisierende Komponente nicht explizit.
- Wolfgang Finkler and Günter Neumann (1988) MORPHIX. A Fast Realization of a Classification-Based Approach to Morphology. Fourth Österreichische Artificial-Intelligence-Tagung. Wiener Workshop - Wissensbasierte Sprachverarbeitung. Proceedings. Berlin etc.: Springer, 11-19.
This paper presents an alternative approach to the use of the Finite State Automata for morphology in inflectional languages. The essential feature is the use of the morphological regularities of these languages to define a fine–grained word–class–specific sub-classification. Morphological analysis and generation can be performed at the level of this classification by means of simple operations on n–ary trees. This approach has been implemented in the package MORPHIX which handles all inflectional phenomena of the German language.
About the software package
Morphix is a very fast and robust morphological component for German (more than 10.000 words in a second on standard PC hardware based on a stem lexicon of 120.000 entries). Besides inflectional analysis, it analyses compounds online and is also able to generate word forms from a given stem entry and some further (optional) morpho-syntactic information. Morphix is implemented in CommonAllegroLisp and should run under all operating systems which support ANSII CL.
Morphix is still in active use and has a high number of dissemination to outside universities and institutes (among others CMU, CSLI Stanford, DEC, Siemens, Eurospider, Uni. Karlsruhe, Osnabrück, Zürich). It was also used for morphological analysis in the Verbmobil project, and it is part of KBML, a development tool for NL generation systems (University of Bremen) and the LKB development system for HPSG–like grammars (Stanford University and University of Cambridge, UK).
The software package is available from the Morphix web page
© Günter Neumann
- Stalin Varanasi, Muhammad Umer Butt, and Günter Neumann (2023) AutoQIR: Auto-Encoding Questions with Retrieval Augmented Decoding for Unsupervised Passage Retrieval and Zero-shot Question Generation, Proceedings of Recent Advances in Natural Language Processing (RANLP-2023), Bulgaria, 2023.