Synthesis of emotional speech

Marc Schröder, Felix Burkhardt, Sacha Krstulović

In: Klaus R. Scherer , Tanja Bänziger , Etienne B. Roesch (Hrsg.). Blueprint for Affective Computing. Seiten 222-231 Oxford University Press Oxford, UK 2010.


The present chapter describes the state of the technology available for generating synthetic speech with emotions, and proposes a possible methodology for establishing a control structure for speech acoustics in terms of non-trivial models of affect. We first review the history of speech synthesis, briefly characterising the various generations of synthesis technologies, from formant synthesis over diphone concatenation and unit selection synthesis to recent statistical model based synthesis technologies. For the two major current technologies, unit selection and statistical synthesis, we describe approaches for generating expressive speech. We also comment on the potential of articulatory speech synthesis, which despite the limited quality that can be reached with state of the art systems is scientifically promising, since it makes explicit the relation between physiological changes and resulting acoustic effects. In the second part, we discuss the problem of predicting voice parameters for a given affective state. Since current technological approaches are data-driven, it is necessary to first record a suitably variable speech corpus from a single speaker. This corpus must be annotated in terms of the affective states that are to be used as control parameters for the voice, and the speech acoustics must be appropriately described. Finally, the two layers must be related to one another, e.g. by machine learning algorithms, to allow for the prediction of speech acoustics based on the affective states.


Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence