[mary-dev] Basic NLP module for Italian

Thu Nov 4 12:35:59 CET 2010

Hello,

I would like ask you for some info and suggestions.
I have a lexicon with transcriptions in the OpenMary format and with 
TranscriptionTool I am able to generate the LTS rules and the lexicon in 
FST format.
I have also a lexicon with POS information that could be used as input 
for marytts.tools.newlanguage.LexiconCreator for the full conversion.

I think the next step is to obtain a MinimalisticPosTagger for Italian. 
Is there something else missing in order to have a first basic NLP module?

I know TranscriptionTool use the funcional word flag to built a pos 
tagger. I have a list of funcional word for Italian and, as first step, 
I could use these in TranscriptionTool. There is any way to pass this 
information in the input transcription file (i.e. abaco a1-ba-ko 
functional)?

This is a first plan, but knowing that I have also the POS information 
ready for the full conversion do you suggest to already built a pos 
tagger able to give other information?

Another more dev-specific questions. The Italian lexicon size is 400000 
entries. I have successfully obtained the rules LTS with 100000 entries 
using TranscriptionTool, but I get an Out of memory error with 400000. I 
will try to run the same increasing the memory with the java -Xmxn flag.
Anyway the question is about svn commit rules: Do I commit also these 
large files (it.txt, it_lexicon.dict, it_lexicon.fst), which perhaps 
will be replaced by new ones?

Best Regards,
Fabio.