[mary-dev] [mary-users] Timestamps mapping for generated audio

Thu Mar 1 21:53:32 CET 2012

Dear Avi,

in principle I agree that it would be nice to be able to select an 
output type corresponding to tokens with start times, perhaps in some 
standard captioning format (not an expert there). However, because of 
the way the module registry in Mary is designed, I think this would have 
to go into the very core, which is already somewhat cluttered with 
rarely used, and possibly deprecated, modules. Adding another one for a 
special use case is probably not attractive.

It would probably make more sense from a software design perspective to 
move all of the non-standard modules (REALISED_DURATIONS, MBROLA, 
PRAAT_TEXTGRID, etc.) out of the module registry into a set of 
client-side MaryXML processors which convert REALISED_ACOUSTPARAMS into 
various useful formats. Haven't really given it much thought, and it's 
certainly open to debate... on mary-dev. =)

In any case, I would recommend prototyping your code in your favorite 
scripting language or XSLT or something. You don't actually need to make 
any modifications in Mary for that, but feel free to fork 
https://github.com/marc1s/marytts; you'll find some client examples 
under marytts-assembly/src/release/doc/examples/client.

Best wishes,

-Ingmar

On 3/1/12 15:12, avi tshuva wrote:
>
> Many thanks, Ingmar, for your quick reply. We're trying to incorporate
> your advice into our code.
> What do you think: adding API methods, that give difference information
> types such as the one i mentioned, without the need to go through XML
> output - do you think it exist/should-exist or not? After all,
> internally, i guess all that information exist, it's just a matter of
> exposing it via a simple and documented API...?
>
> On Thu 01 Mar 2012 03:59:31 PM IST, Ingmar Steiner wrote:
>> Dear Avi,
>>
>> the easiest way is probably OUTPUT_TYPE=REALISED_DURATIONS, which
>> gives you phone endtimes in XWaves lab format.
>>
>> If you need word/token start times, you'll have to parse the MaryXML
>> from OUTPUT_TYPE=REALISED_ACOUSTPARAMS. In that case, you'll be
>> looking for the "end" attribute of the first <ph>one of each <t>oken's
>> first <syllable>, minus the phone's "d"uration / 1000, which works for
>> the HSMM voices. (For the unit selection voices, the predicted phone
>> duration will not match the realized end times exactly, so you'll have
>> to use your own start time variable to avoid syncing issues.)
>>
>> Hope this helps!
>>
>> Best wishes,
>>
>> -Ingmar
>>
>> On 01.03.2012 14:25, avi tshuva wrote:
>>>
>>> Hi,
>>> how can i get the timing information for generated audio? i need it in
>>> order to sync video i'm generating from the text with the audio.
>>>
>>> _for example:_
>>> *text:* "Hello world. How are you?"
>>> *timing map: *
>>> {Hello: 0,
>>> world: 0.1,
>>> How: 0.24
>>> ..
>>> ...
>>> }
>>>
>>> Thank you
>>>
>>>
>>> --
>>> /Avi Tshuva
>>> VP R&D
>>> WatchText /
>>>
>>>
>>> _______________________________________________
>>> Mary-users mailing list
>>> Mary-users at dfki.de
>>> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users
>>
>
>
>

-- 
Ingmar Steiner
Postdoctoral Researcher

LORIA Speech Group, Nancy, France
National Institute for Research in
Computer Science and Control (INRIA)