[mary-users] Fwd: Issues on MaryTTS with SSML Prosody Tag

Fri Nov 4 14:31:41 CET 2016

Hi Gavin,

thanks for your detailed message!

On 04.11.16 12:57, 姚晋 wrote:
> Hello all,
>
> I am using the MaryTTS in an English prosody study, but meet some
> problem as list below, please help check if it is the problem of my SSML
> or there is a problem in Mary with SSML, thanks in advance:
>
> 1. The link to user documentation doesn't work (404 Not found)
> (http://mary.opendfki.de/trac/wiki/ProsodySpecificationSupport
> <http://mary.opendfki.de/trac/wiki/ProsodySpecificationSupport>). As
> shown in attachment 1.

The URL works fine for me.

>
> 2. I use the MaryTTS GUI directly for synthesizing speech with SSML, and
> find "volume" tag works in Bing Speech API, but seems not working in
> Mary. But it works if I modify "volume" with "Audio Effects" GUI
> directly. Please refer attachment 2.1 for SSML, and 2.2 for analysis
> with Praat.

The most important thing to be aware of is that SSML is a 
recommendation, not a standard, and it's up to each "vendor" whether and 
how the various SSML features are implemented (I had an insightful 
discussion with Paul Bagshaw about this a few weeks ago).

In MaryTTS, SSML is parsed and transformed into MaryXML using XSLT [^1]. 
The resulting prosody attributes are then available to the MaryTTS 
modules for processing.

As it happens, any volume attribute is completely ignored by the class 
that handles the prosody element [^2]. In other words, fine-grained 
volume control is not currently possible with MaryXML.

>
> 3. "pitch" tag works for hmm-based voice, but not work accurately for
> unit-selection voice. Please refer attachment 3.1 for SSML, and 3.2 for
> analysis with Praat.

Unit-selection synthesis, by its very nature, does not allow 
fine-grained prosody control. It concatenates the selected units from a 
voice database without modification and offers high naturalness by 
sacrificing flexibility. There have been experiments with signal 
manipulation after concatenation, but this tends to introduce 
unacceptable artifacts.

If you need fine-grained control over prosody, you are better off using 
statistical parametric synthesis or diphone synthesis (e.g., MBROLA). 
Note that the last version of MaryTTS that supported MBROLA was v4.3.1 
-- we plan to support MBROLA again in the near future, but that's still 
work in progress. In the meantime, v4.3.1 should work just fine for you 
under Windows.

If you have further questions, please feel free to engage the developers 
on our issue tracker [^3] and post technical questions or bug reports as 
appropriate.

Best wishes,

-Ingmar

>
> Best Regards,
> Gavin
>
>
>
> _______________________________________________
> Mary-users mailing list
> Mary-users at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users
>

[^1]: 
https://github.com/marytts/marytts/blob/v5.2/marytts-runtime/src/main/resources/marytts/modules/ssml-to-mary.xsl

[^2]: 
https://github.com/marytts/marytts/blob/v5.2/marytts-runtime/src/main/java/marytts/modules/acoustic/ProsodyElementHandler.java

[^3]: https://github.com/marytts/marytts/issues

-- 
/**
  * Dr. Ingmar Steiner
  *
  * Head of Independent Research Group
  * Multimodal Speech Processing
  * Cluster of Excellence MMCI
  *
  * Senior Researcher
  * Multilingual Technologies Lab
  * German Research Center for
  * Artificial Intelligence (DFKI GmbH)
  *
  * Principal Investigator
  * Collaborative Research Center SFB-1102
  * Information Density and Linguistic Encoding
  *
  * Department of Computer Science
  * Department of Computational Linguistics & Phonetics
  * Saarland University
  *
  * Campus C7.4, Room 2.01
  * D-66123 Saarbrücken
  * @tel: +49-681-302-70028
  * @fax: +49-681-302-4317
  * @web: http://coli.uni-saarland.de/~steiner/
  */