[mary-dev] AcousticModeller

Mon Jul 25 15:48:34 CEST 2011

Dear Florent,

On 25.07.2011 13:35, fxavier at ircam.fr wrote:
> Dear Ingmar,
>
> I can see you are the one that coded this module. I know that it preditcs
> the fo and durations using CART, according to the symbolic prosody defined
> by ProsodyGeneric and PronunciationModel.

Yes, if you're referring to things like syllable stress, ToBI accents, 
and boundary placement.

>
> Can you tell me more about it? What's the theory behind it, i.e. how did
> it work?

I hope you won't be disappointed to learn that the AcousticModeller is 
first and foremost a unified replacement for a handful of deprecated 
modules (viz. DummyAllophones2AcoustParams, CARTDurationModeller, 
CARTF0Modeller, HMMDurationF0Modeller) which essentially duplicated very 
similar high-level processing with very different low-level code.

The AcoustingModeller has one purpose, which is to take an utterance and 
generate target values for acoustic parameters such as duration and F0 
for all segments and boundaries. How it does this depends on the voice 
configuration. A unit-selection voice will typically predict those 
continuous features using CARTs, while a HMM-based voice will use HMMs.

The AcousticModeller introduced three major improvements over the old 
design:

1) A *unified interface*: the AcousticModeller applies Models from the 
modules.acoustic package, which can be thought of as wrappers around the 
specific algorithms used for parameter prediction. At the moment, we 
have CARTs, HMMs, and SoP-based models.

2) By the same token, different *types of models can be mixed* within 
the same voice; i.e. nothing prevents you from using a CART for duration 
and HMMs for F0-prediction in a voice if you want.

3) The prosodic parameters are *extensible* by adding custom generic 
continuous ones which might be relevant for your application. For 
example, we experimented with voice quality based parameters for 
expressive unit selection using this feature, but it could be anything, 
even motion trajectories for audiovisual synthesis!

Having said all that, the AcousticModeller itself is implemented in a 
somewhat baroque way, because the newfound flexibility and elegance in 
design is tethered by the constraint of backward compatibility within 
the 4.x generation; there are several hard-coded assumptions and a few 
hacks in the code. But unless you devise a completely different way of 
transporting predicted prosody to a given waveform synthesizer, it 
should serve its purpose reasonably well for now.

> Regarding this paper:
>
>   "Three Method of Intonation Modelling"
> http://www.cs.cmu.edu/~awb/papers/ESCA98_3int.pdf
>
> does the AcousticModeller use the first method?

That depends entirely on the type of Models configured for the voice in 
question. If you have a conventional unit-selection voice with CARTs for 
duration and F0 prediction, in Mary, these parameters are predicted in 
absolute values. You could implement something like Tilt if you wanted, 
but it makes no difference for the AcousticModeller, which simply 
assigns and passes out the attribute values.

>
> I can see there
> http://mary.dfki.de/documentation/publications/schroeder_trouvain2003.pdf
> page 19, that the output of this module is not a MaryXML, but ACOUSTPARAMS
> is a MaryXML, isn' it?

I'm not sure what you're referring to here. The IJST article predates by 
a number of years the architecture for which we designed the 
AcousticModeller. But you're correct in that the output of the 
AcousticModeller is MaryXML at the ACOUSTPARAMS stage.

>
> Finally will you put this module into the NLP components, or into the
> Synthesis ones? To my mind, it is more a synthesis component rather than a
> NLP one.

Sorry, I don't understand the question. The AcousticModeller is in 
place, and occupies a crucial point in the synthesis pipeline. As it 
handles the prediction of acoustic parameters, it is certainly beyond 
the scope of what most people would refer to as NLP.

In case you're wondering which artifact in mavenized Mary contains the 
AcousticModeller and the Models, they're in marytts-common.

Best wishes,

-Ingmar

>
> Thanks in advance for your answers,
>
>
> Florent

-- 
Ingmar Steiner
Postdoctoral Researcher

LORIA Speech Group, Nancy, France
National Institute for Research in
Computer Science and Control (INRIA)