DFKI-LT - Wikinflection: Massive semi-supervised generation of multilingual inflectional corpus from Wiktionary
Wikinflection: Massive semi-supervised generation of multilingual inflectional corpus from Wiktionary
2 Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018),
Linköping Electronic Conference Proceedings, Oslo, Norway, Linköping University Electronic Press, Linköpings universitet, 12/2018
Wiktionary is an open- and crowd-sourced dictionary which has been an important resource for natural language processing/understanding/generation tasks, but a big portion of the available information, such as inflection, is hard to retrieve and has not been widely utilized. In this paper, we are describing our efforts to generate inflectional paradigms for lemmata of the English Wiktionary, by using both the dynamic links of the XML dump file and the static information of the web version. Our system can generate inflectional paradigms for 225K lemmata, with almost 8,5M forms from 1.708 inflectional templates, for over 150 languages, and after evaluating the generation, 216K lemmata and around 6M forms are of high quality. In addition, we retrieve morphological features, affixes and stem allomorphs for each paradigm and form. The system can produce a structured inflectional corpus from any version of the English Wiktionary XML dump file, and could also be adapted for other language versions. The first version of the source code is currently available online.
Files: BibTeX, article.asp, ecp18155014.pdf