Seminar “Unit selection and HMM-based speech synthesis”

Marc Schröder, SS 2007



Date

Name

Topic

We 02 May 07

Marc Schröder

Introduction to Unit Selection synthesis

We 09 May 07

Marc Schröder

Introduction to HMM-based synthesis – see slides from Black, Zen & Tokuda (ICASSP, 2007): Statistical parametric speech synthesis

We 16 May 07

Tobi Kellner

Unit selection: Target costs + join costs

We 23 May 07

Ekaterina Biehl

Non-uniform unit selection

We 30 May 07

Benjamin Roth

HMM-based speech synthesis

We 06 June 07

(no course)

(no course)

We 13 June 07

Lisa Beinborn

Corpus design and coverage

We 20 June 07


(hands-on session)

We 27 June 07

Jennifer Moore

Conditional Random Fields in unit selection

We 04 July 07

Markus Dräger

Trainable parametric models

We 11 July 07

Mat Wilson

Direct prosody prediction and prosodic unit selection

We 18 July 07


(hands-on session)

Topics and Literature

For Hauptseminar/Master level, presenters are expected to prepare a relatively broad overview of a topic; the following literature references are starting points, not necessarily the exhaustive list of papers needed to prepare the presentation.

Unit selection: Target costs

Hunt, A. and Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of ICASSP 96, vol 1, pp 373-376, Atlanta, Georgia. http://www.cs.cmu.edu/~awb/papers/icassp96.pdf

Black, A. and Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. Proceedings of Eurospeech 97, vol. 2, pp. 601-604, Rhodes, Greece. http://www.cs.cmu.edu/~awb/papers/ES97units.pdf

Taylor, P. (2006). The Target Cost Formulation in Unit Selection Speech Synthesis. Proc. Interspeech 2006, Pittsburgh, PA, USA.

Unit selection: Join costs

Vepa, J. and King, S. (2004). Join cost for unit selection speech synthesis. In: Text to Speech Synthesis: Recent Paradigms and Advances (S. Narayanan and A. Alwan, eds.), pp. 35-62.

Wouters, J. and Macon, M. W. (1998). A perceptual evaluation of distance measures for concatenative speech synthesis. Proc. ICSLP 1998, Sydney, Australia. http://speech.bme.ogi.edu/publications/ps/wouters_ICSLP_98.ps.gz

Pantazis, Y., Stylianou, Y., and Klabbers, E. (2005). Discontinuity detection in concatenated speech synthesis based on nonlinear speech analysis, Proceedings Interspeech 2005, Lisbon, Portugal, pp. 2817 – 2820. http://www.bme.ogi.edu/~klabbers/papers/IS051061.pdf

Unit selection: Corpus design and coverage

Black, A. and Lenzo, K. (2001) Optimal Data Selection for Unit Selection Synthesis, pp 63-67, Proc. 4th Speech Synthesis Workshop, Scotland. http://www.cs.cmu.edu/~awb/papers/ISCA01/select.ps

Krul, A., Damnati, G., Yvon, F., and Moudenc, T. (2006). Corpus design based on the Kullback-Leibler divergence for text-to-speech synthesis application, Proc. Interspeech 2006, Pittsburgh, PA, USA.

Bozkurt, B., Dutoit, T., and Ozturk, O. (2003). Text Design For TTS Speech Corpus Building Using A Modified Greedy Selection, Proc. EUROSPEECH, European Conference on Speech Communication and Technology, Geneva, pp 277-280. http://tcts.fpms.ac.be/publications/papers/2003/eurospeech03_bbootd.pdf

Non-uniform unit selection

Schweitzer, A., Braunschweiler, N., Klankert, T., Möbius, B. and Säuberlich, B. (2003). Restricted Unlimited Domain Synthesis. Proceedings of Eurospeech 2003, Geneva, Switzerland, vol. 2, pp. 1321-1324. http://www.ims.uni-stuttgart.de/~schweitz/docs/eurospeech03.ps

Chu, M., Peng, H, Yang, H., and Chang, E. (2001). Selecting non-uniform units from a very large corpus forconcatenative speech synthesizer. Proc. ICASSP'01, Salt Lake City, UT, USA. http://research.microsoft.com/users/minchu/pdfs/pap1311.pdf

Yang, J., Zhao, Z., Jiang, Y., Hu, G., and Wo, X. (2006). Multi-tier Non-uniform Unit Selection for Corpus-based Speech Synthesis. Proc. Blizzard Challenge 2006, Pittsburgh, PA, USA. http://festvox.org/blizzard/bc2006/iflytek_blizzard2006.pdf

HMM-based synthesis: Overview

Tokuda, K., Zen, H. and Black, A. (2004). An HMM-based approach to multilingual speech synthesis. In: Text to Speech Synthesis: Recent Paradigms and Advances (S. Narayanan and A. Alwan, eds.), pp. 135-153.

Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis, Proc. of Eurospeech, pp.2347-2350. http://hts.sp.nitech.ac.jp/?plugin=attach&refer=Publications&openfile=yoshimura_eurospeech1999.pdf

Black, A.W., Zen, H., and Tokuda, K. (2007). Statistical parametric speech synthesis, Proc. of ICASSP, pp.1229-1232. http://www.sp.nitech.ac.jp/~zen/english/index.php?plugin=attach&refer=Publications%2FInternational%20conferences&openfile=awb-icassp07.pdf

Statistical prosody prediction: Direct prediction

Strom, V. (2002). From text to prosody without ToBI. In Proc. ICSLP-2002, Denver, CO, USA, pp. 2081-2084. http://www.cstr.ed.ac.uk/downloads/publications/2002/paper.icslp02.ps

Eide, E.; Aaron, A.; Bakis, R.; Cohen, R.; Donovan, R.; Hamza, W.; Mathes, T.; Picheny, M.; Polkosky, M.; Smith, M.; Viswanathan, M. (2003). Recent improvements to the IBM trainable speech synthesis system. Proc. ICASSP'03, Vol. 1, p. 708-711.

Prosodic unit selection

Raux, A. and Black, A. (2003) A Unit Selection Approach to F0 Modeling and Its Application to Emphasis ASRU 2003, St Thomas, US Virgin Is. http://www.cs.cmu.edu/~awb/papers/asru2003/f0clunits.pdf

Meron, J. (2001). Prosodic unit selection using an imitation speech database. Proc. 4th ISCA Speech synthesis workshop (SSW4), Perthshire, Scotland.

Statistical prosody prediction: Trainable parametric models

Escudero-Mancebo, D. and Cardeñoso-Payo, V. (2007). Applying data mining techniques to corpus based prosodic modeling. Speech Communication, 49 (3), pp. 213-229.

Taylor, P. (1998). The Tilt Intonation Model. Proc. ICSLP98, Sydney, Australia. http://www.cstr.inf.ed.ac.uk/downloads/publications/1998/Taylor_1998_e.pdf

Bailly, G. and Gorisch, J. (2006). Generating German intonation with a trainable prosodic model. Proc. Interspeech 2006, Pittsburgh, PA, USA.

Möhler, G., Conkie, A. (1998). Parametric Modeling of Intonation Using Vector Quantization. 3rd ESCA Workshop on Speech Synthesis, Jenolan Caves, Australia. http://www.ims.uni-stuttgart.de/~moehler/papers/gm_jenolan98vqinton.ps.gz

Conditional Random Fields in Unit Selection

Gregory, M. L. and Altun, Y. (2004). Using conditional random fields to predict pitch accents in conversational speech. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain. http://www.cs.brown.edu/people/altun/pubs/GregoryAltun.pdf

Taylor, P. (1998). The Tilt Intonation Model. Proc. ICSLP98, Sydney, Australia. http://www.cstr.inf.ed.ac.uk/downloads/publications/1998/Taylor_1998_e.pdf

Weiss, C. and Hess, W. (2006). Conditional Random Fields for Hierarchical Segment Selection in Text-to-Speech Synthesis. Proc. Interspeech 2006, Pittsburgh, PA, USA.

Background literature on CRF, e.g. http://www.inference.phy.cam.ac.uk/hmw26/crf/#papers