On the typology of Russian-Bulgarian cognates as statistically extracted from aligned parallel corpora.

Tania Avgustinova, Elena Paskaleva

In: Proceedings of the 7th European Conference on Formal Description of Slavic Languages. European Conference on Formal Description of Slavic Languages (FDSL) 2007.


It is commonly acknowledged that statistical methods based on frequency counts would typically improve by linguistically informed tuning. Especially for closely related languages it is essential to recognize language-pair-dependent similarities as well as to apply dedicated monolingual pre-processing and refining of the measurement of string sequences and the distance between them - like identification of morphological derivations, morpheme collocations, etc. In this contribution we focus on Russian and Bulgarian in order to investigate their systematicity with regard to both typological closeness and linguistic divergence. The similarity scores obtained statistically by (Nakov et al. 2007) for this representative pair of Slavic languages on the basis of aligned parallel corpora provides us with appropriate data for illustrating the idea of linguistically informed re-ranking and language-pair-oriented tuning. In this context, we propose a set of rules to model the typological similarity for the languages under consideration, and then discuss the applied criteria at various linguistic levels.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence