[mary-dev] Lexicon questions

Marc Schroeder marc.schroeder at dfki.de
Fri Oct 29 14:21:27 CEST 2010

Hi Fabio,

> I have proposed to use just a subset because I noticed that de.txt
> contains at the end words without transcription.
> I thought that somehow the words without transcription in this file
> could be used as test set for the LTS rules.
> What's the meaning of the words without transcription in de.txt?

No. You can think of it as "work in progress": we have identified these 
words as relevant for the lexicon, but they have not been transcribed 
yet. In fact, the "train and predict" button on the gui serves exactly 
this function: based on what has already been transcribed, it proposes a 
transcription for the untranscribed words, and when the human user 
accepts (or corrects) them, they will get saved at the next save.


> Best,
> Fabio.
> On 10/28/2010 06:48 PM, Marc Schroeder wrote:
>> Hi Fabio,
>> great to see that you are making progress.
>> Yes, "-" is a syllable boundary.
>> The current input text format to the Transcription GUI does not
>> support POS information, but the binary FST format does. So my
>> suggestion would be, in order to get something simple to work quickly,
>> to discard the POS in a first attempt, and then look at code like the
>> CMUDict import
>> (/OpenMary/java/marytts/tools/newlanguage/en_US/CMUDict2MaryFST.java)
>> as a model to do the full conversion.
>> If you proceed like this, it doesn't matter how many of your lexicon
>> entries you import via the Transcription GUI, because it is only a
>> test-run anyway. Try 10,000 entries first, then 100,000, and see if
>> that works OK.
>> We train the LTS on all entries usually. Do you see a reason for
>> selecting just a subset?
>> Best,
>> Marc
>> On 28.10.10 18:34, Fabio Tesser wrote:
>>> Hi,
>>> I have some questions about the lexicon building process.
>>> Looking to the German lexicon file 'de.txt', I can imagine that '-' is
>>> the symbol used for syllable separation. Am I right?
>>> I have a lexicon for Italian that include part of speech information.
>>> Does the openmary lexicon format support that?
>>> The Italian lexicon is 440000 words large. I see that the 'de.txt'
>>> German file has 36000 words, and that not all the words have the
>>> transcription.
>>> I suppose this file is used in the Transcription Tool to create the LTS
>>> rules, and the full transcribed lexicon is stored in a Finite State
>>> Transducer format.
>>> Any suggestion for the Italian case?
>>> I imagine that I may select some (how many?) words from the lexicon in
>>> order to built the LTS rules
>>> (http://mary.opendfki.de/wiki/TranscriptionTool) and then use all the
>>> file to build Finite State Transducer lexicon (does exist documentation
>>> for this?).
>>> Thanks in advance.
>>> Best,
>>> Fabio.
>>> _______________________________________________
>>> Mary-dev mailing list
>>> Mary-dev at dfki.de
>>> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev

Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Coordinator EU FP7 Project SEMAINE http://www.semaine-project.eu
Project leader for DFKI in SSPNet http://sspnet.eu
Project leader PAVOQUE http://mary.dfki.de/pavoque
Associate Editor IEEE Trans. Affective Computing http://computer.org/tac
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net
Team Leader DFKI TTS Group http://mary.dfki.de

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder at dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313

More information about the Mary-dev mailing list