Efficient Integrated Tagging of Word Constructs

Andrew Bredenkamp, Thierry Declerck, Frederik Fouvry, Bradley Music
1 Proceedings of the 16th International Conference on Computational Linguistics, Pages 1028-1031, Copenhagen, Denmark, Morgan Kaufmann Publishers, 1996
We describe a robust texthandling component, which can deal with free text in a wide range of formats and can successfully identify a wide range of phenomena, including chemical formulae, dates, numbers and proper nouns. The set of regular expressions used to capture numbers in written form ("sechsundzwanzig") in German is given as an example. Proper noun "candidates" are identified by means of regular expressions, these being then rejected or accepted on the basis of runtime interaction with the user. This tagging component is integrated in a largescale grammar development environment, and provides direct input to the grammatical analysis component of the system by means of "lift" rules which convert tagged text into partial linguistic structures.
