Toward a speech synthesis guided by the modeling of unexpected events

Sébastien Le Maguer, Ingmar Steiner, Bernd Möbius

In: Antje Schweitzer , Grzegorz Dogil (Hrsg.). Workshop on Modeling Variability in Speech. Workshop on Modeling Variability in Speech October 1-2 Stuttgart Germany 10/2015.


Over the last 30 years, text to speech (TTS) methodologies have evolved from the selection of real units to the use of complex statistical modeling. However, all state-ofthe-art TTS methodologies use descriptive features, extracted from the text, to achieve the synthesis. Therefore, these features are as crucial as the modeling itself to improve the quality of the achieved synthesis. Currently, descriptive features are mainly derived from low-level linguistic information, such as syllable stress or content information of the word. In this study, we want to capture prosodic effects by applying new descriptive features based on the surprisal of the syllable or the word. Here, the concept of surprisal is borrowed from the field of information theory. Its purpose is to quantify the unpredictability of an event. In practice, it is computed as the negative log probability of an event, given a specific context. Our assumption is, the higher the surprisal of an event (i.e., the occurrence of a syllable or a word), the higher its effect on prosodic features. The MaryTTS system [1] provides a deeply modular speech synthesis framework based on unit selection or hidden Markov model based speech modeling. Therefore, using this system, we have the possibility to conduct a study on the influence of the surprisal on the achieved synthesis on both of these standard methodologies. To this end, we assume the baseline descriptive feature set already used in MaryTTS. We first propose to enrich the baseline descriptive feature set by adding (a) the surprisal of the syllable; (b) the surprisal of the word; (c) both surprisal of the syllable and the word. In order to analyze the use of such descriptive features in place of traditional ones, we also propose two alternatives to the baseline descriptive feature set. These alternatives are obtained by (1) replacing the accent information with the surprisal of the syllable; (2) replacing the content information with the surprisal of the word. Consequently, using these combinations, we expect to qualify the influence of the surprisal on the achieved synthesis. We also plan to assess the use of such high-level descriptive features in a speech synthesis task.


Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence