[mary-users] HMMVoiceMakeData - PhoneFeatureLabelAligner

Wed Aug 3 13:54:19 CEST 2011

Dear Florent,

just to make sure I understood correctly, you have 2999 total 
utterances, and for 2991 of these, there is misalignment, while 8 are 
OK? Or 8 have problems, while 2991 are OK? In the latter case, do what 
Sathish suggests, and simply remove the 8, but in the former, 8 are not 
enough to build a proper voice, read on.

What do the utterances that are OK have that is different for the 
problematic ones? Perhaps manual inspection will give you clues.

My intuition is that the mismatches might have to do with the fact that 
most of your "syllables" are bogus. (This is of course related to your 
other question about syllabification.) Could you manually correct the 
syllabification for a few utterances and see if the number of 
problematic utterances is reduced accordingly?

Best wishes,

-Ingmar

On 03.08.2011 12:11, Sathish Chandra Pammi wrote:
> Dear Florent,
>
> Have you seen any command line output where the exact misalignment is
> happening? I mean phone level mismatch.
>
> There is one quick alternative, but I knew it is a bad solution.
> That is removing/discarding those 8 basenames from basenames list. So
> that these utterances will be discarded from voicebuilding process.
>
> Best regards,
> Sathish
>
> On Tue, Aug 2, 2011 at 1:54 PM, <fxavier at ircam.fr
> <mailto:fxavier at ircam.fr>> wrote:
>
>     Dear all,
>
>     I've corrected the french phonemiser, so there is no silent at the
>     beginning of the transcription. Unfortunately I still have 2991/2999
>     misalignment problem. Here's for example the first sentence, here's the
>     allophone xml and the lab file:
>
>
>
>     <?xml version="1.0" encoding="UTF-8" standalone="no"?><maryxml
>     xmlns="http://mary.dfki.de/2002/MaryXML"
>     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4"
>     xml:lang="fr">
>     <p>
>     <s>
>     <prosody pitch="+5%" range="+20%">
>     <phrase>
>     <boundary breakindex="3"/><t ph="i l" pos="[PPER3MS]">
>     il
>     <syllable ph="i l">
>     <ph p="i"/>
>     <ph p="l"/>
>     </syllable>
>     </t>
>     <t ph="s a v E" pos="[V3S]">
>     savait
>     <syllable ph="s a v E">
>     <ph p="s"/>
>     <ph p="a"/>
>     <ph p="v"/>
>     <ph p="E"/>
>     </syllable>
>     </t>
>     <t ph="k @" pos="[COSUB]">
>     que
>     <syllable ph="k @">
>     <ph p="k"/>
>     <ph p="@"/>
>     </syllable>
>     </t>
>     <t ph="l @" pos="[DETMS]">
>     le
>     <syllable ph="l @">
>     <ph p="l"/>
>     <ph p="@"/>
>     </syllable>
>     </t>
>     <t ph="p 9 p l @" pos="[NMS]">
>     peuple
>     <syllable ph="p 9 p l @">
>     <ph p="p"/>
>     <ph p="9"/>
>     <ph p="p"/>
>     <ph p="l"/>
>     <ph p="@"/>
>     </syllable>
>     </t>
>     <t ph="l @" pos="[PPOBJMS]">
>     le
>     <syllable ph="l @">
>     <ph p="l"/>
>     <ph p="@"/>
>     </syllable>
>     </t>
>     <t ph="s u t @ n E" pos="[V3S]">
>     soutenait
>     <syllable ph="s u t @ n E">
>     <ph p="s"/>
>     <ph p="u"/>
>     <ph p="t"/>
>     <ph p="@"/>
>     <ph p="n"/>
>     <ph p="E"/>
>     </syllable>
>     </t>
>     <boundary breakindex="3"/><t ph="_" pos=",">
>     ,
>
>     </t>
>
>     </phrase>
>     </prosody>
>     <prosody pitch="-5%" range="-20%">
>     <phrase>
>     <t ph="s @" pos="[PDEMFS]">
>     ce
>     <syllable ph="s @">
>     <ph p="s"/>
>     <ph p="@"/>
>     </syllable>
>     </t>
>     <t ph="k i" pos="[PRELFS]">
>     qui
>     <syllable ph="k i">
>     <ph p="k"/>
>     <ph p="i"/>
>     </syllable>
>     </t>
>     <t ph="l H i" pos="[PPOBJMS]">
>     lui
>     <syllable ph="l H i">
>     <ph p="l"/>
>     <ph p="H"/>
>     <ph p="i"/>
>     </syllable>
>     </t>
>     <t ph="a" pos="[VA3S]">
>     a
>     <syllable ph="a">
>     <ph p="a"/>
>     </syllable>
>     </t>
>     <t ph="d 0 n e" pos="[VPPMS]">
>     donné
>     <syllable ph="d 0 n e">
>     <ph p="d"/>
>     <ph p="0"/>
>     <ph p="n"/>
>     <ph p="e"/>
>     </syllable>
>     </t>
>     <t ph="k o~ f j a~ s" pos="[NFS]">
>     confiance
>     <syllable ph="k o~ f j a~ s">
>     <ph p="k"/>
>     <ph p="o~"/>
>     <ph p="f"/>
>     <ph p="j"/>
>     <ph p="a~"/>
>     <ph p="s"/>
>     </syllable>
>     </t>
>     <boundary breakindex="3"/><t ph="_" pos=".">
>     .
>     <syllable ph="_">
>     <ph p="_"/>
>     </syllable>
>     </t>
>
>     </phrase>
>     </prosody>
>     </s>
>     </p>
>     </maryxml>
>
>
>
>
>     format: end time. unit index. phone
>     #
>     1.040000 0 _
>     1.175000 1 i
>     1.190000 2 l
>     1.320000 3 s
>     1.400000 4 a
>     1.435000 5 v
>     1.520000 6 E
>     1.595000 7 k
>     1.610000 8 @
>     1.665000 9 l
>     1.705000 10 @
>     1.800000 11 p
>     1.885000 12 9
>     1.990000 13 p
>     2.115000 14 l
>     2.130000 15 @
>     2.150000 16 l
>     2.165000 17 @
>     2.275000 18 s
>     2.305000 19 u
>     2.400000 20 t
>     2.415000 21 @
>     2.475000 22 n
>     2.570000 23 E
>     2.835000 24 _
>     2.915000 25 s
>     2.940000 26 @
>     3.025000 27 k
>     3.040000 28 i
>     3.055000 29 l
>     3.125000 30 H
>     3.140000 31 i
>     3.205000 32 a
>     3.285000 33 d
>     3.305000 34 0
>     3.380000 35 n
>     3.450000 36 e
>     3.525000 37 k
>     3.585000 38 o~
>     3.650000 39 f
>     3.755000 40 j
>     3.890000 41 a~
>     4.120000 42 s
>     4.720000 43 _
>
>
>
>
>     If the acoustic is not the problem, what should I do? Maybe the feature
>     extractor is responsible for that misalignement? I've selected all the
>     features in the FeaturesSelector. I have no clue of what's going on... I
>     would like to avoid to run again the EHMMLabeller that takes 11h to
>     process.
>
>
>
>     I really need your help,
>     Florent
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>      > Dear Florent,
>      >
>      > On 29.07.2011 01:46, fxavier at ircam.fr <mailto:fxavier at ircam.fr>
>     wrote:
>      >> Dear all,
>      >>
>      >> Again I'm encountering some problems. This time with the
>      >> HMMVoiceMakeData.
>      >> It says:
>      > [...]
>      >> FeatureDefinition extracted from context file:
>      >> /home/florent/FlorentVoice//phonefeatures/utt_2513.pfeats
>      >> The following are other context features used for training Hmms:
>      >>    pos_in_syl  f0
>      >>    position_type  f1
>      >>    prev_syl_break  f2
>      >>    syl_break  f3
>      >> The previous context features were extracted from file:
>      >> /home/florent/FlorentVoice/mary/hmmFeatures.txt
>      >> Extracting monophone and context features (1): utt_2513.pfeats and
>      >> utt_2513.lab
>      >> java.lang.Exception: The component HMMVoiceMakeData produced the
>      >> following
>      >> exception:
>      >>      at
>      >>
>     marytts.tools.voiceimport.DatabaseImportMain$8.run(DatabaseImportMain.java:294)
>      >> Caused by: java.lang.Exception: Error: Number of context
>     features in:
>      >> /home/florent/FlorentVoice//phonefeatures/utt_2513.pfeats is not the
>      >> same
>      >> as the number of labels in:
>      >> /home/florent/FlorentVoice//phonelab/utt_2513.lab
>      >>      at
>      >>
>     marytts.tools.voiceimport.HMMVoiceMakeData.extractMonophoneAndFullContextLabels(HMMVoiceMakeData.java:988)
>      >>      at
>      >>
>     marytts.tools.voiceimport.HMMVoiceMakeData.makeLabels(HMMVoiceMakeData.java:733)
>      >>      at
>      >>
>     marytts.tools.voiceimport.HMMVoiceMakeData.compute(HMMVoiceMakeData.java:233)
>      >>      at
>      >>
>     marytts.tools.voiceimport.DatabaseImportMain$8.run(DatabaseImportMain.java:291)
>      >>
>      >>
>      >>
>      >>
>      >>
>      >> The fact is that PhoneLabelFeatureAligner failed. For 2992 out
>     of 3000
>      >> sentences there's an alignement problem:
>      >
>      > Not good. There must be something systematically wrong with the
>     labeling.
>      >
>      >>
>      >>
>      >>
>      >> (...)
>      >>     utt_999 Adding pause unit in labels before unit 1
>      >>   Feature file is longer than label file:  unit 41 and greater
>     do not
>      >> exist
>      >> in label file
>      >> Remaining problems: 2991
>      >>      utt_1:  Feature file is longer than label file:  unit 45 and
>      >> greater
>      >> do not exist in label file
>      >>   ->  Skipping all utterances ! The problems remain.
>      >> Removed [0/2999] utterances from the list, [2999] utterances remain,
>      >> among
>      >> which [2991/2999] still have problems.
>      >>
>      >>
>      >>
>      >>
>      >> Is this the reason why the HMMVoiceMakeData fail?
>      >
>      > More than likely.
>      >
>      >> What do you think is the
>      >> problem, the corpus? Should I record it again?
>      >
>      > Absolutely not! The acoustics should not be a problem (unless it's
>      > unusually bad quality or lots of mistakes from the speaker, of
>     course).
>      >
>      >> When recorded it I tried to
>      >> be very natural (schwa elision, mandatory and non mandatory
>     liaisons,
>      >> kind
>      >> of fast flow) but the sound quality is very good (mono 16 kHz, used
>      >> AudioConverterGUI to trim the silences and normalization). Maybe the
>      >> phonetical transcriptions didn't take all these typical syntaxic
>      >> problems
>      >> in account exactly as it should be, so the EHMMLabeler didn't
>     labeled it
>      >> right...
>      >
>      > Indeed, the automatic labeling sadly cannot be trusted to produce
>     what
>      > you'd expect.
>      >
>      >> And, for every sentences when checking the RAWMARYXML there is a
>      >> \n at the beginning of the utterance I don't know why. For example:
>      >>
>      >>
>      >>
>      >>
>      >> <?xml version="1.0" encoding="UTF-8" ?>
>      >> <maryxml version="0.4"
>      >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>      >> xmlns="http://mary.dfki.de/2002/MaryXML"
>      >> xml:lang="fr">
>      >> <boundary  breakindex="2" duration="100"/>
>      >>
>      >> il savait que le peuple le soutenait, ce qui lui a donné confiance.
>      >> </maryxml>
>      >>
>      >
>      > Shouldn't make a difference, IMHO. But you could try and see if
>     it does.
>      >
>      >>
>      >>
>      >>
>      >> Don't know if it's a problem though, when correcting this, it
>     says the
>      >> problem still remains.lab files looks pretty ok according to what I
>      >> listen
>      >> in the wav file:
>      >>
>      >>
>      >>
>      >>
>      >> format: end time. unit index. phone
>      >> #
>      >> 1.120000 0 _
>      >> 1.120000 1 _
>      >
>      > That looks a bit suspicious. Normally, zero-length intervals
>     should be
>      > deleted. Might be a problem with PhoneLabelFeatureAligner. Did
>     you run
>      > the PhoneUnitLabelComputer?
>      >
>      > Perhaps you could try and see of you can remove those zero-length
>      > intervals, and whether it magically works afterwards?
>      >
>      >> 1.175000 2 i
>      >> 1.190000 3 l
>      >> 1.325000 4 s
>      >> 1.400000 5 a
>      >> 1.435000 6 v
>      >> 1.520000 7 E
>      >> 1.600000 8 k
>      >> 1.615000 9 @
>      >> 1.670000 10 l
>      >> 1.705000 11 @
>      >> 1.800000 12 p
>      >> 1.885000 13 9
>      >> 1.995000 14 p
>      >> 2.010000 15 l
>      >> 2.030000 16 @
>      >> 2.130000 17 l
>      >> 2.155000 18 @
>      >> 2.230000 19 s
>      >> 2.305000 20 u
>      >> 2.400000 21 t
>      >> 2.415000 22 @
>      >> 2.475000 23 n
>      >> 2.575000 24 E
>      >> 2.835000 25 _
>      >> 2.910000 26 s
>      >> 2.935000 27 @
>      >> 3.025000 28 k
>      >> 3.040000 29 i
>      >> 3.055000 30 l
>      >> 3.150000 31 H
>      >> 3.165000 32 i
>      >> 3.210000 33 a
>      >> 3.285000 34 d
>      >> 3.305000 35 0
>      >> 3.370000 36 n
>      >> 3.450000 37 e
>      >> 3.525000 38 k
>      >> 3.585000 39 o~
>      >> 3.655000 40 f
>      >> 3.755000 41 j
>      >> 3.885000 42 a~
>      >> 4.110000 43 s
>      >> 4.720000 44 _
>      >>
>      >>
>      >>
>      >>
>      >>
>      >> Is there a way to fix this without having to record the 3000
>     sentences
>      >> corpus? I think the quality won't be as good as it should be,
>     but I need
>      >> a
>      >> result in the nex few days, I would obviously record it again later.
>      >
>      > As I said, the problem is unlikely to be caused by the recordings,
>      > unless the audio or speech quality is exceptionally bad. It's the
>      > labels, and the features extracted using them.
>      >
>      > Best wishes,
>      >
>      > -Ingmar
>      >
>      >>
>      >> Thanks in advance,
>      >>
>      >>
>      >>
>      >>
>      >> Florent
>      >>
>      >>
>      >>
>      >>
>      >> _______________________________________________
>      >> Mary-users mailing list
>      >> Mary-users at dfki.de <mailto:Mary-users at dfki.de>
>      >> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users
>      >
>      > --
>      > Ingmar Steiner
>      > Postdoctoral Researcher
>      >
>      > LORIA Speech Group, Nancy, France
>      > National Institute for Research in
>      > Computer Science and Control (INRIA)
>      > _______________________________________________
>      > Mary-users mailing list
>      > Mary-users at dfki.de <mailto:Mary-users at dfki.de>
>      > http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users
>      >
>
>     _______________________________________________
>     Mary-users mailing list
>     Mary-users at dfki.de <mailto:Mary-users at dfki.de>
>     http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users
>
>
>
>
> --
> ------------------------
> Sathish Chandra Pammi, Researcher
> Web: http://www.dfki.de/~chandra/ <http://www.dfki.de/%7Echandra/>
> DFKI GmbH,
> Saarbrücken, Germany
> Tel: +49-17624869114
>
>
>
> _______________________________________________
> Mary-users mailing list
> Mary-users at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users

-- 
Ingmar Steiner
Postdoctoral Researcher

LORIA Speech Group, Nancy, France
National Institute for Research in
Computer Science and Control (INRIA)