Morphix - A Fast and Portable Morphological Component for Inflectional
Günter Neumann, DFKI, Saarbrücken
Morphix is a very fast and robust
component for German. Besides inflectional analysis, it analyses
and is also able to generate wordforms from a given stem entry and some
further (optional) morpho-syntactic information. Morphix
is implemented in CommonAllegroLisp and should run under all operating
systems which support ANSII CL. Morphix
comes with a gzipped tar
file. See the text below how to install and run it.
History of Morphix
The development of Morphix has a rich
A first version of Morphix has already
developed in 1986 as part of a programming course by Wolfgang Finkler
he moved to the Deutsche Börse) and Günter
Neumann (that's me). In 1988, we have improved Morphix in order to
perform morphological generation from stems.
are the two main publications during this area (unfortunately, no
versions are available, and even the latter is in German):
- W. Finkler, G. Neumann:
MORPHIX. A Fast Realization of a Classification-Based Approach to
In: Trost, H. (ed.): 4. Österreichische
Wiener Workshop - Wissensbasierte Sprachverarbeitung. Proceedings.
etc.: Springer, 1988, 11-19.
- W. Finkler, G. Neumann:
MORPHIX - Ein hochportabler Lemmatisierungsmodul für das
FB Informatik, KI-Labor, Memo Nr. 8, Juli 1986.
The Newest Version
In 1995, I started to "re-implement" Morphix
in order to make it ready for new research area of
information extraction systems . The major improvements made
this time are:
In summary, the basic processing strategy employed by the new Morphix
version consists of trie traversal combined with the application of
- the use of tries as the unique storage device for all
information in Morphix (e.g., for lexical
entries, prefix, inflectional endings); besides the usual functionality
(insertion, retrieval, deletion), a number of more complex functions
available most notably a regular trie matcher, and
- online processing of compounds which is realized by means a
traversal. During traversal two-level rules are applied for recognizing
possible decompositions of the word form in question. Morphix
also supports robust processing of compounds, such that compounds are
by means of longest matching substrings found in the lexicon; e.g., the
word ``adfadfeimer'' will return result for ``eimer'' assuming that
is no legal lexical stem.
- A generic and parameterizable output interface for Morphix
has been implemented which returns a normalized feature vector
of the computed morpho-syntactic information. It is possible to
the set of relevant morpho-syntactic features which supports lexical
and feature relaxation.
- On basis of this new interface a very efficient specialized
solver has been implemented which can be used in combination with
parsing techniques in order to perform unification of the
The main references for this version are (unfortunately, I'm still
busy in order to write down a detailed report about the new Morphix
version). Morphix is also part of the information extraction core
and you might check the documentation also there:
Currently, Morphix has been used for the German and Italian language.
English version is under way. Morphix is
in Lisp following Standard Allegro Common Lisp. It has been tested
Solaris, Linux, Windows 98 and Windows-NT. The German version has a
broad coverage, and an excellent speed (5000 words/sec without compound
handling, 2800 words/sec with compound processing (where for each
all lexically possible decompositions are computed).
- G. Neumann, R. Backofen, J. Baur, M. Becker, C. Braun:
Information Extraction Core System for Real World German Text
In Proceedings of 5th ANLP, Washington, March, 1997.
- G. Neumann and G. Mazzini:
Information Extraction (Draft Version)
DFKI, Technical Report, 1999.
About the Software
The directory structure
Morphix comes with a gzipped tar
will create the following folders:
You will also find a file morphix-sys.lisp, a simple make file.
Before compiling and running Morphix you should
edit this file especially the pathname variables. Follow the
inside this file.
- src:contains all the source files
- lex: contains lexical data
- bin: will contain the compiled .fasl files
- docu: contains some documentatio
Compiling and Loading Morphix
Once you have set the correct Morphix path names, then compiling and
- Load the file morphix-sys.lisp into Lisp
- Compile Morphix by calling (COMPILE-MORPHIX)
- Load Morphix by calling (LOAD-MORPHIX); this will not load the
- Load Lexica by calling (LOAD-MLEX); all lexicon files which are
in the variable *lex-file-name-list* will be loaded.
Note that Morphix
lexica have to have the suffix .mlex
- All functions and variables of Morphix are defined in the package
Thus, if you want to test morphix you might call (IN-PACKAGE
:MO) after having loaded Morphix.
There are two ways for performing a benchmark (see also file benchmar.lisp):
- (MORPHIX-BENCHMARK): its performs a run of morphix
on a set
Each sentence is analysed 100 times where the best and worst 30
- (BENCHTEXT): applies a benchmark on a file bound in variable *textpath*
The toplevel functions
Since Morphix has a long history, it also has a long list of top-level
The mother of all top-level functions is (in the following I assume
you are in the package "MORPHIX"):
(WORD-ANALYSIS WORD), where WORD should be string in lowercase (see
also file main.lisp).
(W-A WORD) is a convenient abbreviation. WORD-ANALYSIS returns a list
of readings in a very compact form.
The newest top-level functions are:
The output of INFLEXION can be of
form depending on the active output interface:
- (RT LOWER-CASE-STRING):
this will return a list of readings where each reading is a tripple
of the form (STEM INFLEXION POS), where
is either a string or a list of strings (in case of compounds). See
- (OUTER-RT STRING):
where STRING can be an arbitray string because it will
be scanned by a specific scanner function MORPHIX-READ (see file morphix-reader.lisp)
- a string: the default output is a string coding the
If you used to another output function you can re-use this one by
(SET-TYPES) (see file morphix-to-fset.lisp)
- a DNF representation: call (SET-DNF)
- a feature vector representation based on the DNF form: call
- a DNF-based symbol: call (SET-SYM)
The following is a list of global variables which are usefull for
of the general behaviour of Morphix (see
also file globals.lisp):
- *property-retrieval*: if set to T then the result of each
is cached so that next time a simple lookup is performed. In order to
the cache call the function (RESET-RESULTS).
- *handle-composita*: switch off/on compound processing by either
- *all-composita*: if T compute all possible segmentations; if NIL
only the longest matching compounds
- *handle-unknowns*: switch off/on processing of robust compound
- *all-unknown-composita*: if T then compute all possible robust
- *relevant-morphix-cats*: a list of word categories for which
should be performed. Possinble values are VERB. NOMEN, ADJEKTIV.
list is '(NOMEN ADJEKTIV)
Other toplevel functions:
The file user-interface.lisp a set of file based functions are
which can be used e.g., to generate a fullform lexicon.
The main function used here are regular trie matcher (see file dtree-matcher.lisp).
The main caller is
(DTREE:MATCH-ENTRY REGULAR-STRING LEXICON), where REGULAR-STRING is
a string containing regular expression following standard grep syntax.
LEXICON is either mo::*stem-lexicon* or mo::*fullform-lexicon*.
For example (dtree::match-entry "h.*" mo::*stem-lexicon*) will match
all lexical entries beginning with the letter h.
Note that the function (dtree::match-entry-help NIL) gives a
of possible regular syntax (see file dtree-matcher.lisp).
The toplevel function for generation of word forms from stems is
defined in file morphix-gen.lisp
which contains a description of all possible parametrizations. The
toplevel function is (GENERATE STEM CAT OTHER-KEYS) where STEM is a
CAT a possible word category and OTHER-KEYS appropriate
constraints. The possible values for CAT and OTHER-KEYS can be find in
file morphix-gen.lisp. If OTHER-KEYS is not specified then
default values are used.
Try: (generate "haus" 'NOUN). Note that generation does not support
generation of compounds.