DFKI-LT - Finite-State Machines - Foundations and Applications to Text Processing and Pattern Recognition

Wojciech Skut, Jakub Piskorski, Jan Daciuk
Finite-State Machines - Foundations and Applications to Text Processing and Pattern Recognition
2 ESSLI 2005, Heriot-Watt University, Edinburgh, Scotland, 8/2005
Following decades of intense research, finite-state machines have established themselves as a formal framework for dealing with a broad range of linguistic phenomena as well as text and speech processing tasks. To name only a few: tokenization, morphology and lexica are very often implemented as finite-state transducers. Regular expressions, familiar from Perl and Unix commands such as grep or sed and equivalent to finite automata, are a convenient pattern matching tool. Finite-state machines have also been applied to parsing, spelling correction and many other tasks. Being one of the most thoroughly investigated areas of automata theory, finite-state machines are handled in depth in many brilliant computer science textbooks. So the question arises: why write yet another book on this topic? The answer lies in the specifics of the application. Natural Language Processing (NLP) and Speech Processing often require a perspective different from that of compiler design or Unix tools.

Still, there exist a number of excellent books and tutorials addressing the NLP applications of finite-state machines. However, most of them are closely tied to a specific formalism, such as the XFST toolkit (Karttunen and Beesley 2003) or the INTEX system (Silberztein 1999a). A very good synopsis of finite-state algorithms is provided by Roche and Schabes (1997), but it is limited to the 65 pages of the introductory section. As a result, the relevant knowledge is spread across a vast spectrum of textbooks, journal articles and conference papers: often excellent, but using different notational conventions and requiring different levels of mathematical or computer science background from the reader. All this makes us conclude that there is a need for a textbook explaining the basic finite-state concepts from an NLP-oriented point of view. The primary aim of this book is to convey the basic intuitions behind finite-state machines and show the readers a number of generic use cases for finite-state methods. Starting from a simple pattern recognition task at the beginning of chapter 1, we introduce different variants of finite-state machines and algorithms for manipulating them. Chapter 2 introduces regular expressions as a convenient formal notation equivalent to finite-state machines. A number of generic formalization techniques are explained in chapter 3. Finally, chapter 4 discusses implementation issues. We have attempted to keep the mathematical side of this book as simple as possible. Everyone with an undergraduate-level knowledge of set theory and function notation should be able to understand the mathematical underpinning of the concepts and algorithms presented in our book. No previous knowledge of automata theory is required, although some acquaintance with it will certainly be helpful, as will familiarity with basic NLP and Unix/Perl regular expressions. Of course, it is difficult to cover all finite-state NLP applications in a single book. Conscious of this restriction, we have decided to limit the scope of the present edition, published as a reader for a course at the 2005 European Summer School on Logic, Language and Information (ESSLLI), to finite-state automata and their applications, thus leaving out such important concepts as weighted automata and transducers. We envisage extending our book to these topics at a later time. This book could not have been possible without the support of several people and institutions. We would like to thank Matthew Aylett and Paul Taylor for their valuable comments. Also, we acknowledge the support we received from Rhetorical Systems Ltd in Edinburgh, the German Research Centre for Artificial Intelligence in Saarbr¨ucken and the Gda´nsk University of Technology. The organizing committee of the ESSLLI held in August 2005 at the Heriot-Watt University in Edinburgh provided for a publication of our book as a course reader. Special thanks go to Sue Fitt, who kindly agreed to do the proof-reading. All remaining mistakes are, of course, ours.
Files: BibTeX, fsm.pdf