[mary-dev] Lexicon questions

Marc Schroeder marc.schroeder at dfki.de
Thu Oct 28 18:48:45 CEST 2010

Hi Fabio,

great to see that you are making progress.

Yes, "-" is a syllable boundary.

The current input text format to the Transcription GUI does not support 
POS information, but the binary FST format does. So my suggestion would 
be, in order to get something simple to work quickly, to discard the POS 
in a first attempt, and then look at code like the CMUDict import 
(/OpenMary/java/marytts/tools/newlanguage/en_US/CMUDict2MaryFST.java) as 
a model to do the full conversion.

If you proceed like this, it doesn't matter how many of your lexicon 
entries you import via the Transcription GUI, because it is only a 
test-run anyway. Try 10,000 entries first, then 100,000, and see if that 
works OK.

We train the LTS on all entries usually. Do you see a reason for 
selecting just a subset?


On 28.10.10 18:34, Fabio Tesser wrote:
> Hi,
> I have some questions about the lexicon building process.
> Looking to the German lexicon file 'de.txt', I can imagine that '-' is
> the symbol used for syllable separation. Am I right?
> I have a lexicon for Italian that include part of speech information.
> Does the openmary lexicon format support that?
> The Italian lexicon is 440000 words large. I see that the 'de.txt'
> German file has 36000 words, and that not all the words have the
> transcription.
> I suppose this file is used in the Transcription Tool to create the LTS
> rules, and the full transcribed lexicon is stored in a Finite State
> Transducer format.
> Any suggestion for the Italian case?
> I imagine that I may select some (how many?) words from the lexicon in
> order to built the LTS rules
> (http://mary.opendfki.de/wiki/TranscriptionTool) and then use all the
> file to build Finite State Transducer lexicon (does exist documentation
> for this?).
> Thanks in advance.
> Best,
> Fabio.
> _______________________________________________
> Mary-dev mailing list
> Mary-dev at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev

Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
Project leader for DFKI in SSPNet http://sspnet.eu
Project leader PAVOQUE http://mary.dfki.de/pavoque
Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net
Team Leader DFKI TTS Group http://mary.dfki.de

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder at dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313

More information about the Mary-dev mailing list