Wikinflection: Massive semi-supervised generation of multilingual inflectional corpus from Wiktionary

Eleni Metheniti; Günter Neumann
In: Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018). International Workshop on Treebanks and Linguistic Theories (TLT-2018), December 13-14, Oslo, Norway, Linköping Electronic Conference Proceedings, ISBN 978-91-7685-137-1, Linköping University Electronic Press, Linköpings universitet, 12/2018.


Wiktionary is an open- and crowd-sourced dictionary which has been an important resource for natural language processing/understanding/generation tasks, but a big portion of the available information, such as inflection, is hard to retrieve and has not been widely utilized. In this paper, we are describing our efforts to generate inflectional paradigms for lemmata of the English Wiktionary, by using both the dynamic links of the XML dump file and the static information of the web version. Our system can generate inflectional paradigms for 225K lemmata, with almost 8,5M forms from 1.708 inflectional templates, for over 150 languages, and after evaluating the generation, 216K lemmata and around 6M forms are of high quality. In addition, we retrieve morphological features, affixes and stem allomorphs for each paradigm and form. The system can produce a structured inflectional corpus from any version of the English Wiktionary XML dump file, and could also be adapted for other language versions. The first version of the source code is currently available online.


