An HMM/DNN comparison for synchronized text-to-speech and tongue motion synthesis

Sébastien Le Maguer; Ingmar Steiner; Alexander Hewer

In: Interspeech 2017. Conference in the Annual Series of Interspeech Events (INTERSPEECH-2017), August 20-24, Stockholm, Sweden, Pages 239-243, ISCA, 2017.


We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard methodologies, based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), respectively, to train both acoustic models and the tongue model parameter weights. We evaluate both methodologies at every step by comparing the predicted articulatory movements against the reference data. The results show that even with less than 2h of data, DNNs already outperform HMMs.


Weitere Links

IS2017a.pdf (pdf, 204 KB )

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz