Skip to main content Skip to main navigation


Linguistically-augmented Perplexity-based Data Selection for Language Models

Antonio Toral; Pavel Pecina; Longyue Wan; Josef van Genabith
In: Computer Speech & Language (CSL), Vol. 32, No. 1, Pages 11-26, Academic Press, 7/2015.


This paper explores the use of linguistic information for the selection of data to train language models. We depart from the state-of-the-art method in perplexity-based data selection and extend it in order to use word-level linguistic units (i.e. lemmas, named entity categories and part-of-speech tags) instead of surface forms. We then present two methods that combine the different types of linguistic knowledge as well as the surface forms (1, naïve selection of the top ranked sentences selected by each method; 2, linear interpolation of the datasets selected by the different methods). The paper presents detailed results and analysis for four languages with different levels of morphologic complexity (English, Spanish, Czech and Chinese). The interpolation-based combination outperforms the purely statistical baseline in all the scenarios, resulting in language models with lower perplexity. In relative terms the improvements are similar regardless of the language, with perplexity reductions achieved in the range 7.72–13.02%. In absolute terms the reduction is higher for languages with high type-token ratio (Chinese, 202.16) or rich morphology (Czech, 81.53) and lower for the remaining languages, Spanish (55.2) and English (34.43 on the English side of the same parallel dataset as for Czech and 61.90 on the same parallel dataset as for Spanish).

Weitere Links