[mary-dev] mary xml

Tue Jun 7 17:49:51 CEST 2011

Dear All,

I managed to split the output of lia_phon (french phonemizer) in order to
have:

- the words
- the phonemes
- the part of speech

The phonemes were in the lia_phon format. For example, the word "toujours"
would be in lia_phon: "ttoujjourr". So I managed to convert the liaphon
format into the SAMPA one.

For example, if we give our code the following input text:

"J'ai toujours l'ouïe très fine à 100%."

We have as an output:

- the raw lia_phon transcription (we don't use it, just to show you)

<transcription><s> ## [ZTRM->EXCEPTION]
j' jj [PPER1S]
ai ai [VA1S]
toujours ttoujjourr [ADV]
l' ll [DETFS]
ouïe wwii [NFS]
très ttrrai [ADV]
fine ffiinn [AFS]
à aa [PREPADE]
cent ssan [CHIF]
pour_cent ppourrssan [NMS]
pause ##
</s> ## [ZTRM->EXCEPTION]
<FIN> ???? []
</transcription>

- the words (it is one string, not a tab):

<s>
j'
ai
toujours
l'
ouïe
très
fine
à
cent
pour_cent
.
</s>

- the SAMPA phonemes (one string)

_
Z
E
tuZuR
l
wi
tRE
fin
a
sa~
puRsa~
_
_

- and the part of speech (one string)

[ZTRM->EXCEPTION]
[PPER1S]
[VA1S]
[ADV]
[DETFS]
[NFS]
[ADV]
[AFS]
[PREPADE]
[CHIF]
[NMS]
[SILENT]
[ZTRM->EXCEPTION]
[]

Now the next step is to create the MaryXML thanks to these datas. How to
do this, since we have three strings (words, phonemes, and pos)? I see
there's a class called TextToMaryXML, should I use this? Can you please
tell me how?

Also, is there a standard for part of speech taggs? As you can see the pos
from lia_phon are particular. But I wonder if we don't mind since the
system train with its own taggs?

Best regards,

Florent