Marc Schröder, SS 2007
|
Date |
Name |
Topic |
|---|---|---|
|
We 02 May 07 |
Marc Schröder |
|
|
We 09 May 07 |
Marc Schröder |
Introduction to HMM-based synthesis – see slides from Black, Zen & Tokuda (ICASSP, 2007): Statistical parametric speech synthesis |
|
We 16 May 07 |
Tobi Kellner |
Unit selection: Target costs + join costs |
|
We 23 May 07 |
Ekaterina Biehl |
Non-uniform unit selection |
|
We 30 May 07 |
Benjamin Roth |
HMM-based speech synthesis |
|
We 06 June 07 |
(no course) |
(no course) |
|
We 13 June 07 |
Lisa Beinborn |
Corpus design and coverage |
|
We 20 June 07 |
|
(hands-on session) |
|
We 27 June 07 |
Jennifer Moore |
Conditional Random Fields in unit selection |
|
We 04 July 07 |
Markus Dräger |
Trainable parametric models |
|
We 11 July 07 |
Mat Wilson |
Direct prosody prediction and prosodic unit selection |
|
We 18 July 07 |
|
(hands-on session) |
For Hauptseminar/Master level, presenters are expected to prepare a relatively broad overview of a topic; the following literature references are starting points, not necessarily the exhaustive list of papers needed to prepare the presentation.
Hunt, A. and Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of ICASSP 96, vol 1, pp 373-376, Atlanta, Georgia. http://www.cs.cmu.edu/~awb/papers/icassp96.pdf
Black, A. and Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. Proceedings of Eurospeech 97, vol. 2, pp. 601-604, Rhodes, Greece. http://www.cs.cmu.edu/~awb/papers/ES97units.pdf
Taylor, P. (2006). The Target Cost Formulation in Unit Selection Speech Synthesis. Proc. Interspeech 2006, Pittsburgh, PA, USA.
Vepa, J. and King, S. (2004). Join cost for unit selection speech synthesis. In: Text to Speech Synthesis: Recent Paradigms and Advances (S. Narayanan and A. Alwan, eds.), pp. 35-62.
Wouters, J. and Macon, M. W. (1998). A perceptual evaluation of distance measures for concatenative speech synthesis. Proc. ICSLP 1998, Sydney, Australia. http://speech.bme.ogi.edu/publications/ps/wouters_ICSLP_98.ps.gz
Pantazis, Y., Stylianou, Y., and Klabbers, E. (2005). Discontinuity detection in concatenated speech synthesis based on nonlinear speech analysis, Proceedings Interspeech 2005, Lisbon, Portugal, pp. 2817 – 2820. http://www.bme.ogi.edu/~klabbers/papers/IS051061.pdf
Black, A. and Lenzo, K. (2001) Optimal Data Selection for Unit Selection Synthesis, pp 63-67, Proc. 4th Speech Synthesis Workshop, Scotland. http://www.cs.cmu.edu/~awb/papers/ISCA01/select.ps
Krul, A., Damnati, G., Yvon, F., and Moudenc, T. (2006). Corpus design based on the Kullback-Leibler divergence for text-to-speech synthesis application, Proc. Interspeech 2006, Pittsburgh, PA, USA.
Bozkurt, B., Dutoit, T., and Ozturk, O. (2003). Text Design For TTS Speech Corpus Building Using A Modified Greedy Selection, Proc. EUROSPEECH, European Conference on Speech Communication and Technology, Geneva, pp 277-280. http://tcts.fpms.ac.be/publications/papers/2003/eurospeech03_bbootd.pdf
Schweitzer, A., Braunschweiler, N., Klankert, T., Möbius, B. and Säuberlich, B. (2003). Restricted Unlimited Domain Synthesis. Proceedings of Eurospeech 2003, Geneva, Switzerland, vol. 2, pp. 1321-1324. http://www.ims.uni-stuttgart.de/~schweitz/docs/eurospeech03.ps
Chu, M., Peng, H, Yang, H., and Chang, E. (2001). Selecting non-uniform units from a very large corpus forconcatenative speech synthesizer. Proc. ICASSP'01, Salt Lake City, UT, USA. http://research.microsoft.com/users/minchu/pdfs/pap1311.pdf
Yang, J., Zhao, Z., Jiang, Y., Hu, G., and Wo, X. (2006). Multi-tier Non-uniform Unit Selection for Corpus-based Speech Synthesis. Proc. Blizzard Challenge 2006, Pittsburgh, PA, USA. http://festvox.org/blizzard/bc2006/iflytek_blizzard2006.pdf
Tokuda, K., Zen, H. and Black, A. (2004). An HMM-based approach to multilingual speech synthesis. In: Text to Speech Synthesis: Recent Paradigms and Advances (S. Narayanan and A. Alwan, eds.), pp. 135-153.
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis, Proc. of Eurospeech, pp.2347-2350. http://hts.sp.nitech.ac.jp/?plugin=attach&refer=Publications&openfile=yoshimura_eurospeech1999.pdf
Black, A.W., Zen, H., and Tokuda, K. (2007). Statistical parametric speech synthesis, Proc. of ICASSP, pp.1229-1232. http://www.sp.nitech.ac.jp/~zen/english/index.php?plugin=attach&refer=Publications%2FInternational%20conferences&openfile=awb-icassp07.pdf
Strom, V. (2002). From text to prosody without ToBI. In Proc. ICSLP-2002, Denver, CO, USA, pp. 2081-2084. http://www.cstr.ed.ac.uk/downloads/publications/2002/paper.icslp02.ps
Eide, E.; Aaron, A.; Bakis, R.; Cohen, R.; Donovan, R.; Hamza, W.; Mathes, T.; Picheny, M.; Polkosky, M.; Smith, M.; Viswanathan, M. (2003). Recent improvements to the IBM trainable speech synthesis system. Proc. ICASSP'03, Vol. 1, p. 708-711.
Raux, A. and Black, A. (2003) A Unit Selection Approach to F0 Modeling and Its Application to Emphasis ASRU 2003, St Thomas, US Virgin Is. http://www.cs.cmu.edu/~awb/papers/asru2003/f0clunits.pdf
Meron, J. (2001). Prosodic unit selection using an imitation speech database. Proc. 4th ISCA Speech synthesis workshop (SSW4), Perthshire, Scotland.
Escudero-Mancebo, D. and Cardeñoso-Payo, V. (2007). Applying data mining techniques to corpus based prosodic modeling. Speech Communication, 49 (3), pp. 213-229.
Taylor, P. (1998). The Tilt Intonation Model. Proc. ICSLP98, Sydney, Australia. http://www.cstr.inf.ed.ac.uk/downloads/publications/1998/Taylor_1998_e.pdf
Bailly, G. and Gorisch, J. (2006). Generating German intonation with a trainable prosodic model. Proc. Interspeech 2006, Pittsburgh, PA, USA.
Möhler, G., Conkie, A. (1998). Parametric Modeling of Intonation Using Vector Quantization. 3rd ESCA Workshop on Speech Synthesis, Jenolan Caves, Australia. http://www.ims.uni-stuttgart.de/~moehler/papers/gm_jenolan98vqinton.ps.gz
Gregory, M. L. and Altun, Y. (2004). Using conditional random fields to predict pitch accents in conversational speech. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain. http://www.cs.brown.edu/people/altun/pubs/GregoryAltun.pdf
Taylor, P. (1998). The Tilt Intonation Model. Proc. ICSLP98, Sydney, Australia. http://www.cstr.inf.ed.ac.uk/downloads/publications/1998/Taylor_1998_e.pdf
Weiss, C. and Hess, W. (2006). Conditional Random Fields for Hierarchical Segment Selection in Text-to-Speech Synthesis. Proc. Interspeech 2006, Pittsburgh, PA, USA.
Background literature on CRF, e.g. http://www.inference.phy.cam.ac.uk/hmw26/crf/#papers