[mary-users] PRAAT_TEXTGRID - first test and problem with times

Tue Sep 7 13:07:37 CEST 2010

Dear Brigitte,

> I put the glottal stop in to emphasize that very specific kangaroo.

I'm sorry, this is very confusing. The canonical pronunciation for the German word "Känguru" should not contain a glottal stop. In fact, glottal stops have a very specific distribution in German phonotactics, and do not occur clustered with other stops in the same syllable. If you want emphasis, injecting phones into a token's pronunciation is certainly an aberration, and seems to be the source of your troubles. I strongly suggest you remove it. Instead you should rely on any of several available mechanisms for prosody control, such as boundaries, duration, and pitch specifications.

> Your last question sounds as if you wanted to know how the XML file was written. Very simple answer again: using SAMPA German and oxygen or TextWrangler.

No, I wanted to know if perhaps the phone sequence had been generated by a modified lexicon or faulty LTS rules. But you've already answered that.

Best wishes,

/**
 * Ingmar Steiner
 * Researcher, Language Technology
 * German Research Center for Artificial Intelligence
 *
 * Campus D3 1 +1.18
 * D-66123 Saarbrücken
 * Germany
 * Phone: ++49-681-857-75-5263 (NEW!)
 * Email: ingmar.steiner at dfki.de
 *
 * Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
 * Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
 * Geschäftsführung:
 * Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
 * Dr. Walter Olthoff
 * Vorsitzender des Aufsichtsrats:
 * Prof. Dr. h.c. Hans A. Aukes
 * Amtsgericht Kaiserslautern, HRB 2313
 */

> 
> Cheers again! 
> Brigitte
> 
> 
> Am 07.09.2010 um 12:04 schrieb Ingmar Steiner:
> 
>> Dear Brigitte, 
>> 
>>>> To me it looks like the glottal stop is realized as laryngealized, not voiceless; glottal pulses are clearly visible. But that's just one realization in the voice data. What I'd like to know is how the glottal stop got between the words "vom" and "Känguru" in the first place. Did you manually insert it?
>>> 
>>> first I send TakeKaenguru.xml, and you will know how the glottal stop came in.
>> 
>> Sorry, maybe my question was ambiguous. I'm aware that you did not simply insert a "?" interval into the TextGrid, but that it was synthesized from a requested phone sequence. This sequence includes the value of the "ph" attribute in the
>> 
>>> <t ph = "'?kEN-gU-Ru:" >Känguru</t>
>> 
>> 
>> snippet of the XML file you attached. My question is, why is there a glottal stop in the phones for the token "Känguru"? How was this XML file created?
>> 
>> Best wishes,
>> 
>> /**
>> * Ingmar Steiner
>> * Researcher, Language Technology
>> * German Research Center for Artificial Intelligence
>> *
>> * Campus D3 1 +1.18
>> * D-66123 Saarbrücken
>> * Germany
>> * Phone: ++49-681-857-75-5263 (NEW!)
>> * Email: ingmar.steiner at dfki.de
>> *
>> * Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
>> * Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>> * Geschäftsführung:
>> * Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> * Dr. Walter Olthoff
>> * Vorsitzender des Aufsichtsrats:
>> * Prof. Dr. h.c. Hans A. Aukes
>> * Amtsgericht Kaiserslautern, HRB 2313
>> */
>> 
>>> If you listen to the individualized de7 glottal stop in the sample, using the text grid, you will doubt as well that it is annotated alright. 
>>> 
>>>> 
>>>>> According to me, it comes nearer to reality. The sample is again the kangaroo phrase. I just send the png, the glottal stop is highlighted. This will suffice to convince you.
>>>> 
>>>> Sorry, I've lost track of what the issue is. Are you asking about the Praat TextGrid export, or the duration prediction for glottal stops, or the labeling of individual units in the pavoque voice data?
>>> My issue is a clear annotation of sound, phones and larger units. The intervals should correspond to sound reality.
>>> 
>>>> 
>>>>> So my question adapts after your first answer: are there new female German voices on in Mary TTS, with a better transparency?
>>>> 
>>>> AFAIA, a female German HMM voice is in preparation. In the meantime, I would like to encourage you to try the (male) bits3-hsmm voice, as the quality should be superior to diphone synthesis.
>>> 
>>> Good that a new female German voice is on. As male intonation differs quite a lot from female one (sorry for the truism!), unfortunately male bits3-hsmm is only  of limited help.
>>> 
>>> Cheers
>>> 
>>> Brigitte
>>> <TakeKaenguru.xml>
>>> 
>>>> 
>>>> Best wishes,
>>>> 
>>>> /**
>>>> * Ingmar Steiner
>>>> * Researcher, Language Technology
>>>> * German Research Center for Artificial Intelligence
>>>> *
>>>> * Campus D3 1 +1.18
>>>> * D-66123 Saarbrücken
>>>> * Germany
>>>> * Phone: ++49-681-857-75-5263 (NEW!)
>>>> * Email: ingmar.steiner at dfki.de
>>>> *
>>>> * Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
>>>> * Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>>>> * Geschäftsführung:
>>>> * Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>> * Dr. Walter Olthoff
>>>> * Vorsitzender des Aufsichtsrats:
>>>> * Prof. Dr. h.c. Hans A. Aukes
>>>> * Amtsgericht Kaiserslautern, HRB 2313
>>>> */
>>>> 
>>>>> 
>>>>> Cheers from Hannover
>>>>> 
>>>>> Brigitte
>>>>> 
>>>>> <kaengurupavoque.png>
>>>>> 
>>>>> 
>>>>> 
>>>>> Am 06.09.2010 um 10:08 schrieb Ingmar Steiner:
>>>>> 
>>>>>> Dear Brigitte,
>>>>>> 
>>>>>> the PRAAT_TEXTGRID output type is essentially a conversion from REALISED_ACOUSTPARAMS to a format convenient for import into Praat. Specifically, the duration information present in MaryXML is formatted as a TextGrid with one or more IntervalTiers. As mentioned in previous messages, the Praat TextGrid support should still be considered experimental.
>>>>>> 
>>>>>> You do not mention which voice you (or the anonymous "curious user") used, but your example sounds very much like the de7 female MBROLA voice. MBROLA is a diphone synthesizer, and furthermore does not permit close inspection of its internal processing. The MaryXML is converted to MBROLA format, and passed to the MBROLA binary, which uses the requested voice to generate AUDIO. It is not unlikely that the durations specified in MaryXML (which form the basis of the Praat TextGrid format, as explained above) do not match the phone boundaries in the waveform generated by MBROLA. If you discover that there are systematic mismatches reproducible under certain conditions, it may be a problem with the MBROLA voice data, or possibly a bug in the PraatTextGridGenerator code. Please provide all of the details to me once you have determined that the problem is indeed not with the voice.
>>>>>> 
>>>>>> The second and third tiers in the three-tier TextGrid format about which you inquire contain information particular to unit-selection synthesis, viz. the diphone unit boundaries, and the intervals of consecutive units from the same source recording, respectively. These tiers are useful for the analysis of unit-selection itself and debugging. They are only generated by the UnitSelectionSynthesizer, i.e. when using a unit-selection voice.
>>>>>> 
>>>>>> Best wishes,
>>>>>> 
>>>>>> /**
>>>>>> * Ingmar Steiner
>>>>>> * Researcher, Language Technology
>>>>>> * German Research Center for Artificial Intelligence
>>>>>> *
>>>>>> * Campus D3 1 +1.18
>>>>>> * D-66123 Saarbrücken
>>>>>> * Germany
>>>>>> * Phone: ++49-681-857-75-5263 (NEW!)
>>>>>> * Email: ingmar.steiner at dfki.de
>>>>>> *
>>>>>> * Deutsches Forschungszentrum für Künstliche Intelligenz GmbH
>>>>>> * Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>>>>>> * Geschäftsführung:
>>>>>> * Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>>> * Dr. Walter Olthoff
>>>>>> * Vorsitzender des Aufsichtsrats:
>>>>>> * Prof. Dr. h.c. Hans A. Aukes
>>>>>> * Amtsgericht Kaiserslautern, HRB 2313
>>>>>> */
>>>>>> 
>>>>>> On 5 Sep 2010, at 07:53, Brigitte Endres-Niggemeyer wrote:
>>>>>> 
>>>>>>> Dear all and dear Ingmar,
>>>>>>> 
>>>>>>> the very curious user of the text grid tested the new Mary TTS version, specifically the PRAAT_TEXTGRID. And she found that phone limits may be misplaced.
>>>>>>> In my very simple test phrase, the glottal stop is too early. I send the png, the text grid and the sound file, hoping that this is enough for a check. 
>>>>>>> My obvious question is how the start of the exemplary glottal stop can be adjusted. Doing this by hand editing is not the final option!
>>>>>>> 
>>>>>>> Next question: The website Mary demo now produces a three-layer grid with phones, units, and sources. Great, for what applications do you propose this version? Please explain this to me and to others!
>>>>>>> 
>>>>>>> Cheers from Hannover
>>>>>>> 
>>>>>>> Brigitte
>>>>>>> 
>>>>>>> 
>>>>>>> <Kaenguru.zip>
>>>>>>> 
>>>> 
>>> 
>>> x Brigitte Endres-Niggemeyer, Prof. Dr. phil. habil.
>>> x FH Hannover          xx     xx   x
>>> x Fakultaet III - Medien, Information und Design  xx xxx  xx
>>> x Expo Plaza 12          xxxx  xxxx xxx  xx xx
>>> x 30539 Hannover        xx  xx     xx
>>> x        xxx    x   xxxx   x x
>>> x   xx xx xx  x xxx xxx
>>> x    xx    xxxxx   xxxx   xx xx x
>>> x Tel. +49 511 92 96 2641      xxxxx  xxx xxxxxxxxxx
>>> x  zuHause  +49 511 84 41 690 xxxxxx   xxx  xxx xxx    xx   xx
>>> x  mobil 015154726114 xxx  xx   xxx   xxx       xxx
>>> x     xx xxxx xx xx    xx xx x xx   xxx
>>> x     xx    xx   xxx xx   xx   x
>>> x    xxxx xxxx xxxxx xxx xxxx xxxxx xxxxxxxx
>>> x            x             xxxxxxx  x xxxxxxxxxxxxxxxxxxx
>>> x            xxx         xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxx
>>> x            xxxxxxxx   xxxxx    xxxxx xxxxxxxxxxxxxxxxxxxx
>>> x            xxxxxxxxx xxxxx  xxxx  xxxx xxxxxxxxxxxxxxxxxxx
>>> x            xxx xxxx xxxx xxxxxxxxxxxxx xxxx xxxxxxxxxxxxx
>>> x            xxxx x  xxxx xxxxxxxxxxxxxxxx xxx xxxxxxxxxxx
>>> x            xxxx xxxx xxxxxxxxxxxxxxxxxx xxx xxxxxxxxx
>>> x            x x xxxx xxxxxxxxxxxxxxxxxxx xxxx xxxxxxx
>>> x            xx   xxxx xxxxxxxxxxxxxxxxxx  xxxxxxxx
>>> x            xxx  xxxx xxxxxxxxxxxxxxxx  xxxxxx
>>> x            xxxx xx  xxxxxxxxxxxxxxxxx  xxxx
>>> x            xxx xxxxxx   xxxxxxxxxx
>>> x            xxx xxx
>>> x            xxxxx         "spiritus flat ubi vult"
>>> x             xx           Der Geist weht, wo er will.
>>> x             x
>>> x Brigitte.Endres-Niggemeyer at fh-hannover.de
>>> x brigitteen at googlemail.com
>>> x http://endres-niggemeyer.fh-hannover.de/
>>> x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>