[mary-users] MARY can not be trained with big corpus (12252 utterances - 20 hours)

Ramil Məmmədov ramil2319 at gmail.com
Wed Jan 14 06:33:49 CET 2015


Hello everyone,

I want to train MARY with huge data (20 hrs. of utterances).
But the training fails at DurationCARTTrainer giving timeline_waveform.mry
file exception "File seems corrupt: index is expected after data, not
before".

java.lang.Exception: The component DurationCARTTrainer produced the
following exception:
    at marytts.tools.voiceimport.
DatabaseImportMain$8.run(DatabaseImportMain.java:297)
Caused by: marytts.exceptions.MaryConfigurationException: Cannot load
timeline file from /home/marytts-master/myvoice/timeline_waveforms.mry
    at
marytts.unitselection.data.TimelineReader.<init>(TimelineReader.java:128)
    at
marytts.unitselection.data.TimelineReader.<init>(TimelineReader.java:105)
    at
marytts.tools.voiceimport.DurationCARTTrainer.compute(DurationCARTTrainer.java:136)
    at
marytts.tools.voiceimport.DatabaseImportMain$8.run(DatabaseImportMain.java:294)
Caused by: java.lang.IllegalArgumentException: Size exceeds
Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:745)
    at
marytts.unitselection.data.TimelineReader.load(TimelineReader.java:219)
    at
marytts.unitselection.data.TimelineReader.<init>(TimelineReader.java:126)

timeline_waveform.mry file size is 2.3GB;

And log file:

Opening GUI.......
Loading database layout:
Reading config file /home/marytts-master/myvoice/database.config
Reading external binaries config file
/home/marytts-master/lib/external/externalBinaries.config
Loading basename list from file /home/marytts-master/myvoice/basenames.lst
Found 12252 files in basename list
Loading features from file /home/marytts-master/myvoice/mary/features.txt
---- Making a pitch synchronous waveform timeline

Processing [12252] utterances.

---- Detected a global sample rate of: [16000] Hz.
---- Folding the wav files according to the pitchmarks...
Will create the waveform timeline in file
[/home/marytts-master/myvoice/mary/timeline_waveforms.mry].
00001
00002
00003
00004
00005
00006
00007
........
---- Done.
---- Waveform timeline result:
Number of files scanned: 12252
Total speech duration: [1055199702] samples / [65949.98] seconds.
(Speech duration approximated from EST Track float times: [65950.5]
seconds.)
Number of frames: [11713104].
Size of the index: [659500] (10552000 bytes, i.e. 10.06 megs).
---- Waveform timeline done.


Can anybody have solitions?


-- 
br,
Ramil Mammadov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.dfki.de/pipermail/mary-users/attachments/20150114/3e0fe4fa/attachment.htm 


More information about the Mary-users mailing list