[mary-users] HMMVoiceMakeData - alignement problems

Fri Jul 29 14:58:53 CEST 2011

Dear Florent,

On 29.07.2011 01:46, fxavier at ircam.fr wrote:
> Dear all,
>
> Again I'm encountering some problems. This time with the HMMVoiceMakeData.
> It says:
[...]
> FeatureDefinition extracted from context file:
> /home/florent/FlorentVoice//phonefeatures/utt_2513.pfeats
> The following are other context features used for training Hmms:
>    pos_in_syl  f0
>    position_type  f1
>    prev_syl_break  f2
>    syl_break  f3
> The previous context features were extracted from file:
> /home/florent/FlorentVoice/mary/hmmFeatures.txt
> Extracting monophone and context features (1): utt_2513.pfeats and
> utt_2513.lab
> java.lang.Exception: The component HMMVoiceMakeData produced the following
> exception:
> 	at
> marytts.tools.voiceimport.DatabaseImportMain$8.run(DatabaseImportMain.java:294)
> Caused by: java.lang.Exception: Error: Number of context features in:
> /home/florent/FlorentVoice//phonefeatures/utt_2513.pfeats is not the same
> as the number of labels in:
> /home/florent/FlorentVoice//phonelab/utt_2513.lab
> 	at
> marytts.tools.voiceimport.HMMVoiceMakeData.extractMonophoneAndFullContextLabels(HMMVoiceMakeData.java:988)
> 	at
> marytts.tools.voiceimport.HMMVoiceMakeData.makeLabels(HMMVoiceMakeData.java:733)
> 	at
> marytts.tools.voiceimport.HMMVoiceMakeData.compute(HMMVoiceMakeData.java:233)
> 	at
> marytts.tools.voiceimport.DatabaseImportMain$8.run(DatabaseImportMain.java:291)
>
>
>
>
>
> The fact is that PhoneLabelFeatureAligner failed. For 2992 out of 3000
> sentences there's an alignement problem:

Not good. There must be something systematically wrong with the labeling.

>
>
>
> (...)
>     utt_999 Adding pause unit in labels before unit 1
>   Feature file is longer than label file:  unit 41 and greater do not exist
> in label file
> Remaining problems: 2991
>      utt_1:  Feature file is longer than label file:  unit 45 and greater
> do not exist in label file
>   ->  Skipping all utterances ! The problems remain.
> Removed [0/2999] utterances from the list, [2999] utterances remain, among
> which [2991/2999] still have problems.
>
>
>
>
> Is this the reason why the HMMVoiceMakeData fail?

More than likely.

> What do you think is the
> problem, the corpus? Should I record it again?

Absolutely not! The acoustics should not be a problem (unless it's 
unusually bad quality or lots of mistakes from the speaker, of course).

> When recorded it I tried to
> be very natural (schwa elision, mandatory and non mandatory liaisons, kind
> of fast flow) but the sound quality is very good (mono 16 kHz, used
> AudioConverterGUI to trim the silences and normalization). Maybe the
> phonetical transcriptions didn't take all these typical syntaxic problems
> in account exactly as it should be, so the EHMMLabeler didn't labeled it
> right...

Indeed, the automatic labeling sadly cannot be trusted to produce what 
you'd expect.

> And, for every sentences when checking the RAWMARYXML there is a
> \n at the beginning of the utterance I don't know why. For example:
>
>
>
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <maryxml version="0.4"
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xmlns="http://mary.dfki.de/2002/MaryXML"
> xml:lang="fr">
> <boundary  breakindex="2" duration="100"/>
>
> il savait que le peuple le soutenait, ce qui lui a donné confiance.
> </maryxml>
>

Shouldn't make a difference, IMHO. But you could try and see if it does.

>
>
>
> Don't know if it's a problem though, when correcting this, it says the
> problem still remains.lab files looks pretty ok according to what I listen
> in the wav file:
>
>
>
>
> format: end time. unit index. phone
> #
> 1.120000 0 _
> 1.120000 1 _

That looks a bit suspicious. Normally, zero-length intervals should be 
deleted. Might be a problem with PhoneLabelFeatureAligner. Did you run 
the PhoneUnitLabelComputer?

Perhaps you could try and see of you can remove those zero-length 
intervals, and whether it magically works afterwards?

> 1.175000 2 i
> 1.190000 3 l
> 1.325000 4 s
> 1.400000 5 a
> 1.435000 6 v
> 1.520000 7 E
> 1.600000 8 k
> 1.615000 9 @
> 1.670000 10 l
> 1.705000 11 @
> 1.800000 12 p
> 1.885000 13 9
> 1.995000 14 p
> 2.010000 15 l
> 2.030000 16 @
> 2.130000 17 l
> 2.155000 18 @
> 2.230000 19 s
> 2.305000 20 u
> 2.400000 21 t
> 2.415000 22 @
> 2.475000 23 n
> 2.575000 24 E
> 2.835000 25 _
> 2.910000 26 s
> 2.935000 27 @
> 3.025000 28 k
> 3.040000 29 i
> 3.055000 30 l
> 3.150000 31 H
> 3.165000 32 i
> 3.210000 33 a
> 3.285000 34 d
> 3.305000 35 0
> 3.370000 36 n
> 3.450000 37 e
> 3.525000 38 k
> 3.585000 39 o~
> 3.655000 40 f
> 3.755000 41 j
> 3.885000 42 a~
> 4.110000 43 s
> 4.720000 44 _
>
>
>
>
>
> Is there a way to fix this without having to record the 3000 sentences
> corpus? I think the quality won't be as good as it should be, but I need a
> result in the nex few days, I would obviously record it again later.

As I said, the problem is unlikely to be caused by the recordings, 
unless the audio or speech quality is exceptionally bad. It's the 
labels, and the features extracted using them.

Best wishes,

-Ingmar

>
> Thanks in advance,
>
>
>
>
> Florent
>
>
>
>
> _______________________________________________
> Mary-users mailing list
> Mary-users at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-users

-- 
Ingmar Steiner
Postdoctoral Researcher

LORIA Speech Group, Nancy, France
National Institute for Research in
Computer Science and Control (INRIA)