DFKI-LT - Dissertation Series
Brigitte Krenn: The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations
price: € 13
The work presented is a contribution to the solution of two major problems related to handling lexical collocations, namely the insufficiency of merely frequency-based or statistics-based approaches to collocation identification, and the inappropriateness of competence grammatical analyses and descriptions of collocations. By lexical collocations we mean lexically determined word combinations within particular syntactic structures.
In the thesis, computational linguistics methods and tools are provided for
- identifying collocations from arbitrary text,
- representing collocations in a relational database combining
competence-based descriptions and data of language usage.
Existing approaches for modeling the lexical association between words are discussed and new strategies for identifying collocations from large text corpora are presented. Collocation type-specific linguistic information is utilized in two ways. On the one hand, the numeric notion of span is replaced by a syntactic one. On the other hand, phrase entropy is employed to model the grammatical rigidity in collocation phrases, and potential collocates are used as lexical keys. Experiments are set up for testing the feasibility of the individual lexical association models and combinations of the models described. The models are tested with respect to identifying German preposition-noun-verb collocations. The results of the experiments show that the ability of the models for collocation identification is influenced by factors such as lexical variation in the corpus from which the collocations are identified, the co-occurrence frequency of the candidate data, and the class of collocations to be identified. With respect to the representation of collocations, a relational model is defined where competence-based descriptions of collocations at lexical, morphological and syntactic level are combined in a large scale with actual occurrences of collocations in text corpora. Strategies and tools for a semi-automatic implementation of the database are described.