[mary-users] adding support for Belarusian

Nickolay V. Shmyrev nshmyrev at yandex.ru
Tue Apr 29 22:06:15 CEST 2014


Dear Alina

It is possible to skip Wikipedia import, you just extract the list of frequent words from Wikipedia. If you have a lexicon already, you can use it on next stage.

It is possible to convert hand-written festival G2P/LTS rules to openmary code and implement your own Phonemiser, however, it's probably better to use Transcription tool and use standard data-driven LTS framework from Openmary. You just need to create sufficiently large dictionary and feed it into transcription tool and transcription tool will be able to learn the LTS rules from the data.

You need to create a lexicon in a simple text format, you can find example in mary sources, take a look for fr.txt:

ç' s
çà 'sa functional
1er pR at -'mje
1ere
a 'a
ça 'sa
a. 'a
abaissé 'a-bE-se
abaissa 'a-bE-sa
abaissaient 'a-bE-sE
abaissait 'a-bE-sE
abaissant 'a-bE-sa~
abaisse 'a-bEs

Please also note that it contains syllable boundaries marked with '-'.

Then you import this file in transcription tool and it creates LTS rules in Mary's binary format which can be reused by other components. See also
http://mary.opendfki.de/wiki/TranscriptionTool

You can create such text dictionary with Festival LTS code by feeding all the words from the frequent list into Festival and dumping the results. The simple scheme function can do this.

Overall there is no strict requirement to follow a guide for adding a new language, you just need to understand what is going on. Most input data must be converted to a simple text format.

Feel free to ask if you have more questions on this.

29.04.2014, 11:42, "Alina Vasileuskaya" <tts4belarusian at gmail.com>:
> Hi all,
>
> I would like to add support for Belarusian in Mary TTS. I do have clean text data, allophones list, letter-to-sound rules, pronunciation dictionary (Janus and Festival format) as well as recorded speech data with corresponding transcriptions (Festival format). Is it possible to import them directly into Mary to build basic NLP components without going through Wikipedia import and Transcription GUI? In which format the dictionary and g2r are supposed to be? Will be grateful for any suggestions.


More information about the Mary-users mailing list