[mary-dev] Lexicon questions

Fabio Tesser fabio.tesser at gmail.com
Fri Oct 29 11:26:15 CEST 2010


Hi Marc,

thanks for the hints.

I have proposed to use just a subset because I noticed that de.txt 
contains at the end words without transcription.
I thought that somehow the words without transcription in this file 
could be used as test set for the LTS rules.
What's the meaning of the words without transcription in de.txt?

Best,
Fabio.


On 10/28/2010 06:48 PM, Marc Schroeder wrote:
> Hi Fabio,
>
> great to see that you are making progress.
>
> Yes, "-" is a syllable boundary.
>
> The current input text format to the Transcription GUI does not 
> support POS information, but the binary FST format does. So my 
> suggestion would be, in order to get something simple to work quickly, 
> to discard the POS in a first attempt, and then look at code like the 
> CMUDict import 
> (/OpenMary/java/marytts/tools/newlanguage/en_US/CMUDict2MaryFST.java) 
> as a model to do the full conversion.
>
> If you proceed like this, it doesn't matter how many of your lexicon 
> entries you import via the Transcription GUI, because it is only a 
> test-run anyway. Try 10,000 entries first, then 100,000, and see if 
> that works OK.
>
> We train the LTS on all entries usually. Do you see a reason for 
> selecting just a subset?
>
> Best,
> Marc
>
> On 28.10.10 18:34, Fabio Tesser wrote:
>> Hi,
>>
>> I have some questions about the lexicon building process.
>>
>> Looking to the German lexicon file 'de.txt', I can imagine that '-' is
>> the symbol used for syllable separation. Am I right?
>> I have a lexicon for Italian that include part of speech information.
>> Does the openmary lexicon format support that?
>>
>> The Italian lexicon is 440000 words large. I see that the 'de.txt'
>> German file has 36000 words, and that not all the words have the
>> transcription.
>> I suppose this file is used in the Transcription Tool to create the LTS
>> rules, and the full transcribed lexicon is stored in a Finite State
>> Transducer format.
>>
>> Any suggestion for the Italian case?
>> I imagine that I may select some (how many?) words from the lexicon in
>> order to built the LTS rules
>> (http://mary.opendfki.de/wiki/TranscriptionTool) and then use all the
>> file to build Finite State Transducer lexicon (does exist documentation
>> for this?).
>>
>> Thanks in advance.
>>
>> Best,
>> Fabio.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Mary-dev mailing list
>> Mary-dev at dfki.de
>> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev
>


More information about the Mary-dev mailing list