[mary-dev] AcousticModeller

Marcela Charfuelan Marcela.Charfuelan at dfki.de
Wed Jul 27 14:39:49 CEST 2011

Hello Florent,

On 07/26/2011 03:45 PM, fxavier at ircam.fr wrote:
> Dear Ingmar,
> Thanks a lot for your answers. I actually have a few other questions.
> If I understand well, one can use CART for f0 and HMM for durations for
> example, independantly of the type of synthesis (HMM or unit-selction). So
> we can chose whatever we want from the three algorithm: CART, HMM and SoP.
I think you can use the cart models of a unit selection voice to predict 
duration and F0 in a hmm voice, for example if you have the two voices 
en_US-cmu-slt.config and en_US-cmu-hsmm.slt then you can combine models 
defining them on the configuration file.

For example if you want to use carts (from the unit selection voice) to 
predict duration and f0 in a hmm voice, then make the following changes 
in the hmm config file:

# acoustic models to use (HMM models or carts from other voices can be 
#(uncoment to allow prosody modification specified in MARYXML input)
##voice.cmu-slt-hsmm.acousticModels = duration F0

##voice.cmu-slt-hsmm.duration.model = hmm
##voice.cmu-slt-hsmm.duration.data = 
##voice.cmu-slt-hsmm.duration.attribute = d

##voice.cmu-slt-hsmm.F0.model = hmm
##voice.cmu-slt-hsmm.F0.data = MARY_BASE/conf/en_US-cmu-slt-hsmm.config
##voice.cmu-slt-hsmm.F0.attribute = f0

# Voice-specific prosody CARTs:
voice.cmu-slt-hsmm.duration.cart = MARY_BASE/lib/voices/cmu-slt/dur.tree
voice.cmu-slt-hsmm.duration.featuredefinition = 
voice.cmu-slt-hsmm.f0.cart.left = MARY_BASE/lib/voices/cmu-slt/f0.left.tree
voice.cmu-slt-hsmm.f0.cart.mid = MARY_BASE/lib/voices/cmu-slt/f0.mid.tree
voice.cmu-slt-hsmm.f0.cart.right = 
voice.cmu-slt-hsmm.f0.featuredefinition = 

# Modules to use for predicting acoustic target features for this voice:
voice.cmu-slt-hsmm.preferredModules =  \

If you want to use f0 and duration HMM models (from the HMM voice) to 
predict duration and f0 in a unit selection voice then make the 
following changes in the unit selection config file:

# Modules to use for predicting acoustic target features for this voice:
#voice.cmu-slt.preferredModules =  \

voice.cmu-slt.acousticModels = duration F0

voice.cmu-slt.duration.model = hmm
voice.cmu-slt.duration.data = MARY_BASE/conf/en_US-cmu-slt-hsmm.config
voice.cmu-slt.duration.attribute = d

voice.cmu-slt.F0.model = hmm
voice.cmu-slt.F0.data = MARY_BASE/conf/en_US-cmu-slt-hsmm.config
voice.cmu-slt.F0.attribute = f0

> What do you think is the best choice? Which one of them is the most
> succeful? Does it depends on the type of synthesis?
Actually I never have time to test that much, but predicting f0 from 
carts in hmms was not so successful.
> What is SoP exactly? I never heard of it (nothing on the internet)?
That is just linear regression, I was trying to implement sum of 
products for prediction, as in the paper of :
But this was just kind of baseline system, to compare to prediction 
using hmms or carts.

> I think with those answers I will have enough information on the
> AcousticModeller.
> Thanks in advance,
> Florent
>> Dear Florent,
>> Have a look at the HMMModel in modules.acoustic, but if you want to
>> understand its inner workings, you'll eventually have to dive into the
>> htsengine package. From what I understand, that in turn goes back to the
>> original HTS engine (pre-API for Mary 4.x), which Marcela ported. Sorry,
>> but I'm a little out of my depth there.
>> But if you'd rather focus on the deprecated HMMDurationF0Modeller, be my
>> guest... =)
>> Best wishes,
>> -Ingmar
>> On 25.07.2011 18:08, fxavier at ircam.fr wrote:
>>> Dear Ingmar,
>>> What could be really intersting, as I'm trying to build an HMM voice, is
>>> to know how the deprecated module HMMDurationF0Modeller works, this
>>> module
>>> is if I understand "included" into the new AcousticModeller. I saw that
>>> Marcela coded it. Any informations about how the f0 and the durations
>>> are
>>> processed using HMMs would be helpful (this is for my master thesis, I'm
>>> explaining the generic modules, like AcousticModeller).
>>> Thanks a lot,
>>> Florent
>>>> Dear Florent,
>>>> On 25.07.2011 13:35, fxavier at ircam.fr wrote:
>>>>> Dear Ingmar,
>>>>> I can see you are the one that coded this module. I know that it
>>>>> preditcs
>>>>> the fo and durations using CART, according to the symbolic prosody
>>>>> defined
>>>>> by ProsodyGeneric and PronunciationModel.
>>>> Yes, if you're referring to things like syllable stress, ToBI accents,
>>>> and boundary placement.
>>>>> Can you tell me more about it? What's the theory behind it, i.e. how
>>>>> did
>>>>> it work?
>>>> I hope you won't be disappointed to learn that the AcousticModeller is
>>>> first and foremost a unified replacement for a handful of deprecated
>>>> modules (viz. DummyAllophones2AcoustParams, CARTDurationModeller,
>>>> CARTF0Modeller, HMMDurationF0Modeller) which essentially duplicated
>>>> very
>>>> similar high-level processing with very different low-level code.
>>>> The AcoustingModeller has one purpose, which is to take an utterance
>>>> and
>>>> generate target values for acoustic parameters such as duration and F0
>>>> for all segments and boundaries. How it does this depends on the voice
>>>> configuration. A unit-selection voice will typically predict those
>>>> continuous features using CARTs, while a HMM-based voice will use HMMs.
>>>> The AcousticModeller introduced three major improvements over the old
>>>> design:
>>>> 1) A *unified interface*: the AcousticModeller applies Models from the
>>>> modules.acoustic package, which can be thought of as wrappers around
>>>> the
>>>> specific algorithms used for parameter prediction. At the moment, we
>>>> have CARTs, HMMs, and SoP-based models.
>>>> 2) By the same token, different *types of models can be mixed* within
>>>> the same voice; i.e. nothing prevents you from using a CART for
>>>> duration
>>>> and HMMs for F0-prediction in a voice if you want.
>>>> 3) The prosodic parameters are *extensible* by adding custom generic
>>>> continuous ones which might be relevant for your application. For
>>>> example, we experimented with voice quality based parameters for
>>>> expressive unit selection using this feature, but it could be anything,
>>>> even motion trajectories for audiovisual synthesis!
>>>> Having said all that, the AcousticModeller itself is implemented in a
>>>> somewhat baroque way, because the newfound flexibility and elegance in
>>>> design is tethered by the constraint of backward compatibility within
>>>> the 4.x generation; there are several hard-coded assumptions and a few
>>>> hacks in the code. But unless you devise a completely different way of
>>>> transporting predicted prosody to a given waveform synthesizer, it
>>>> should serve its purpose reasonably well for now.
>>>>> Regarding this paper:
>>>>>     "Three Method of Intonation Modelling"
>>>>> http://www.cs.cmu.edu/~awb/papers/ESCA98_3int.pdf
>>>>> does the AcousticModeller use the first method?
>>>> That depends entirely on the type of Models configured for the voice in
>>>> question. If you have a conventional unit-selection voice with CARTs
>>>> for
>>>> duration and F0 prediction, in Mary, these parameters are predicted in
>>>> absolute values. You could implement something like Tilt if you wanted,
>>>> but it makes no difference for the AcousticModeller, which simply
>>>> assigns and passes out the attribute values.
>>>>> I can see there
>>>>> http://mary.dfki.de/documentation/publications/schroeder_trouvain2003.pdf
>>>>> page 19, that the output of this module is not a MaryXML, but
>>>>> is a MaryXML, isn' it?
>>>> I'm not sure what you're referring to here. The IJST article predates
>>>> by
>>>> a number of years the architecture for which we designed the
>>>> AcousticModeller. But you're correct in that the output of the
>>>> AcousticModeller is MaryXML at the ACOUSTPARAMS stage.
>>>>> Finally will you put this module into the NLP components, or into the
>>>>> Synthesis ones? To my mind, it is more a synthesis component rather
>>>>> than
>>>>> a
>>>>> NLP one.
>>>> Sorry, I don't understand the question. The AcousticModeller is in
>>>> place, and occupies a crucial point in the synthesis pipeline. As it
>>>> handles the prediction of acoustic parameters, it is certainly beyond
>>>> the scope of what most people would refer to as NLP.
>>>> In case you're wondering which artifact in mavenized Mary contains the
>>>> AcousticModeller and the Models, they're in marytts-common.
>>>> Best wishes,
>>>> -Ingmar
>>>>> Thanks in advance for your answers,
>>>>> Florent
>>>> --
>>>> Ingmar Steiner
>>>> Postdoctoral Researcher
>>>> LORIA Speech Group, Nancy, France
>>>> National Institute for Research in
>>>> Computer Science and Control (INRIA)
>>> _______________________________________________
>>> Mary-dev mailing list
>>> Mary-dev at dfki.de
>>> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
>> --
>> Ingmar Steiner
>> Postdoctoral Researcher
>> LORIA Speech Group, Nancy, France
>> National Institute for Research in
>> Computer Science and Control (INRIA)
> _______________________________________________
> Mary-dev mailing list
> Mary-dev at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev

  Marcela Charfuelan, Researcher, DFKI GmbH
  Projektbuero Berlin, Alt-Moabit 91c, D-10559 Berlin, Germany
  Phone: +49 (0)30 23895-1821
  URL  : http://www.dfki.de/~charfuel/
  Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
  Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
  Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
  Dr. Walter Olthoff
  Vorsitzender des Aufsichtsrats:
  Prof. Dr. h.c. Hans A. Aukes
  Amtsgericht Kaiserslautern, HRB 2313

More information about the Mary-dev mailing list