[mary-users] Markup language for MARY TTS

Wed Feb 15 09:49:19 CET 2012

Dear Nikolai,

On 14.02.2012 19:04, Nikolai Kouznetsov wrote:
> Hello,
>
> 1. What is the markup language that I should use to control the output
> of the MARY TTS?

MaryXML, see http://mary.dfki.de/documentation/maryxml

> 2. Where can I find the API I can use to speak with the server from
> within my own application, not from within the provided client application.

http://mary.dfki.de/javadoc/4.3.0/

More specifically, see e.g. 
http://www.dfki.de/pipermail/mary-dev/2011-November/000247.html

But you can really use whatever you want to send the HTTP request, see e.g.

http://mary.opendfki.de/browser/trunk/marytts-assembly/src/release/doc/examples/client

or

http://www.dfki.de/pipermail/mary-users/2010-November/000666.html

> 3. Is this API aware of a markup?

See above?

>
>
> Just out of curiosity, how large is the speech material for one voice
> delivering sound of good quality? Please, in terms of minutes/hours and
> not in terms of MB.

For unit selection voices, the audio is essentially uncompressed PCM, 
usually sampled at 16 kHz, 16 bit, stored in timeline_waveforms.mry. So 
if that file is, say, 110 MB in size, you can (very roughly) estimate 
that it contains about one hour of audio:

110 mebioctets ≈ 112500 kibioctets = 115200000 octets = 57600000 samples 
= 3600 seconds = 60 minutes = 1 hour

The timeline file also contains an index, so by this reckoning you'll be 
a little over the actual audio duration, but it's a start.

For the HSMM voices, it all works quite differently, since the voice 
does not contain any audio, only models trained on a certain amount of 
audio. There's really no way of knowing how much data was used for 
training, but for those voices that have both a unit selection and an 
HSMM variant (e.g. "dfki-poppy" and "dfki-poppy-hsmm"), the HSMM voice 
would have been trained on the data that is included in the unit 
selection voice.

Best wishes,

-Ingmar

>
> Best regards,
> Nikolai
>
>

-- 
Ingmar Steiner
Postdoctoral Researcher

LORIA Speech Group, Nancy, France
National Institute for Research in
Computer Science and Control (INRIA)