[mary-dev] mary xml

Marc Schroeder marc.schroeder at dfki.de
Wed Jun 15 10:16:42 CEST 2011


Florent,

I guess I don't quite understand the question. As a model for what it 
should look like, you can take any of the existing languages and 
generate an output format of PHONEMES. For example the English sentence 
"Hello world." would look in PHONEMES format like this:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" 
xml:lang="en-GB">
<p>
<voice name="dfki-poppy">
<s>
<t g2p_method="lexicon" ph="h @ - ' l @U" pos="UH">
Hello
</t>
<t g2p_method="lexicon" ph="' w r= l d" pos="NN">
world
</t>
<t pos=".">
.
</t>
</s>
</voice>
</p>
</maryxml>

Here the <voice> element and the g2p_method attributes are optional, you 
can ignore them.

So apart from the top and bottom of this, which must be the same except 
for xml:lang="fr", you must create a <t> element for every line in this 
lia_phon output:

> j' jj [PPER1S]

<t ph="Z" pos="PPER1S">j'</t>

> ai ai [VA1S]

<t ph="E" pos="VA1S">ai</t>

> toujours ttoujjourr [ADV]

<t ph="t u - ' Z u R">toujours</t>

etc.

Note that in the "toujours" example I have added syllabification and 
syllable stress (which is probably easy in French, since it is always 
the last syllable I think?).

The way to create this XML structure in Java is to use the DOM classes, 
via the convenient MaryXML helper class. Something like:

Document marydoc = MaryXML.newDocument();
Element root = marydoc.getDocumentElement(); // <maryxml>
Element paragraph = MaryXML.appendChildElement(root, MaryXML.PARAGRAPH); 
// <p>
Element sentence = MaryXML.appendChildElement(paragraph, 
MaryXML.SENTENCE); // <s>
// assuming you have three arrays of same length, words, sampa, and pos:
for (int i=0; i<words.length; i++) {
     Element t = MaryXML.appendChildElement(sentence, MaryXML.TOKEN);
     t.setTextContent(words[i]);
     t.setAttribute("ph", sampa[i]);
     t.setAttribute("pos", pos[i]);
}

MaryData phonemesData = new MaryData(MaryDataType.PHONEMES, 
Locale.FRENCH, false);
phonemesData.setDocument(marydoc);

Or similar.

Regarding the pos tagset, we should really have something consistent, 
but for now we don't.

Hope that helps, best,
Marc

> l' ll [DETFS]
> ouïe wwii [NFS]
> très ttrrai [ADV]
> fine ffiinn [AFS]
> à aa [PREPADE]
> cent ssan [CHIF]
> pour_cent ppourrssan [NMS]
> pause ##
> </s>  ## [ZTRM->EXCEPTION]
> <FIN>  ???? []
> </transcription>
>
> - the words (it is one string, not a tab):
>
> <s>
> j'
> ai
> toujours
> l'
> ouïe
> très
> fine
> à
> cent
> pour_cent
> .
> </s>
>
> - the SAMPA phonemes (one string)
>
> _
> Z
> E
> tuZuR
> l
> wi
> tRE
> fin
> a
> sa~
> puRsa~
> _
> _
>
> - and the part of speech (one string)
>
> [ZTRM->EXCEPTION]
> [PPER1S]
> [VA1S]
> [ADV]
> [DETFS]
> [NFS]
> [ADV]
> [AFS]
> [PREPADE]
> [CHIF]
> [NMS]
> [SILENT]
> [ZTRM->EXCEPTION]
> []
>
>
> Now the next step is to create the MaryXML thanks to these datas. How to
> do this, since we have three strings (words, phonemes, and pos)? I see
> there's a class called TextToMaryXML, should I use this? Can you please
> tell me how?
>
> Also, is there a standard for part of speech taggs? As you can see the pos
> from lia_phon are particular. But I wonder if we don't mind since the
> system train with its own taggs?
>
> Best regards,
>
>
>
> Florent
>
>
>
> _______________________________________________
> Mary-dev mailing list
> Mary-dev at dfki.de
> http://www.dfki.de/mailman/cgi-bin/listinfo/mary-dev

-- 
Dr. Marc Schröder, Senior Researcher at DFKI GmbH
Project leader for DFKI in SSPNet http://sspnet.eu
Team Leader DFKI TTS Group http://mary.dfki.de
Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
Portal Editor http://emotion-research.net

Homepage: http://www.dfki.de/~schroed
Email: marc.schroeder at dfki.de
Phone: +49-681-85775-5303
Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3, D-66123 
Saarbrücken, Germany
--
Official DFKI coordinates:
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313


More information about the Mary-dev mailing list