[mary-users] MaryTTS: Few questions

Tue Sep 3 13:15:43 CEST 2013

Dear Karthik,

On 9/2/13 10:23 PM, Karthik Krishnan wrote:
> Hi Ingmar,
>
> I am a programmer from Intel with lot of interest in speech technologies
> (hobby interest). I have fortunately bumped into MaryTTs and have been
> experimenting with it. First off, I would like to thank you and other
> involved in putting the toolkit/framework together. This has been
> excellent and has clearly succeeding in one of the goals (i.e getting
> non-speech experts to be able to build and play with it!).

Thanks for your positive feedback! Ease of integration is one of our 
main goals, so it's encouraging to hear you managed to get MaryTTS up 
and running so quickly.

>
> I do have some questions and hopefully you can guide me. My goal is to
> be able to build a system where anyone could train their voice and use
> it (for example, think of people losing their voice due to age/medical
> reasons such as throat cancer or people simply wanting to bring the
> voices of deceased ones etc etc).
>
> a) In my experiments, the automatic labelling is not accurate. Are there
> any tips on making it do better labelling ? For example, a friend of
> mine gave me his sample voice for 15 minutes. I build the automatic
> labeller and then I harvested phonemes from the lab files inside lab
> directory (for example, I concatenated all the individual wav files
> corresponding to each label and simply listened to it to see if they are
> accurate. Some of them sound correct, some don't). If you have any
> recommendations on optimizing parameters etc, I would really appreciate.

There are numerous tools out there to perform forced alignment of audio 
and known text, and there are a number of parameters one can tweak. Your 
mileage will certainly vary, and getting perfect phonetic segmentation 
in a fully automatic way is unfortunately quite an elusive goal.

If you feel up to the challenge, you could also try the HTKLabeler 
recently improved by Fabio and Giulio (see 
https://github.com/marytts/marytts/pull/107, we plan to merge it ASAP).

Another tool that has drawn attention recently and seems promising is 
Kaldi (http://kaldi.sourceforge.net/), but I haven't tried it out yet.

Having said that, 15 min. of speech is very, very little to work with, 
so I wouldn't expect good results...

Another concept here (which is somewhat beyond the scope of the MaryTTS 
project currently) is voice adaptation, where an existing voice is 
modified to sound like a target speaker. For this you do only need a 
small amount of data, but like I said, I don't think it's currently 
possible in MaryTTS without some serious hacking.

>
> b) Also are there any requirements on the frequency of wav files (I see
> for "male" it automatically sets the min-max frequency. Should I change
> those to whatever range my voice samples end up with ?

You are referring to what we call pitch range. This strongly affects the 
precision of the pitch tracking which ultimately defines the "datagram" 
packets in a unit-selection voice, and the "melody" of the synthetic 
speech, as well as the overall quality of the output. The default pitch 
range values are really just a ballpark estimate, and should be adjusted 
to what's actually in your data.

>
> c) Are there any tips on filtering the input training wav samples to
> pick only those that will yield best results ? Doubt if there any
> automatic ways - but hey, no harm in asking experts..

Database pruning is certainly an option to remove outliers and reduce 
the voice footprint. Unfortunately, that's still beyond the 
debugging/profiling capabilities in the MaryTTS voicebuilding tools. It 
would be great to have, though, so contributions are very welcome!

>
> Many thanks again for putting this all together. This has been real fun!

I hope it continues to be fun! There are several things coming in the 
near future, and a new release is just around the corner... =)

Best wishes,

-Ingmar

>
> Thanks & Regards
> Karthik

-- 
/**
  * Dr. Ingmar Steiner
  *
  * Head of Independent Research Group
  * Multimodal Speech Processing
  * Cluster of Excellence MMCI
  *
  * Senior Researcher
  * Language Technology Lab
  * German Research Center for
  * Artificial Intelligence (DFKI GmbH)
  *
  * Adjunct Assistant Professor
  * Department of Computer Science
  * Saarland University
  *
  * Campus C7.4, Room 3.01
  * D-66123 Saarbrücken
  * @tel: +49-681-302-70028
  * @fax: +49-681-302-4317
  * @web: http://coli.uni-saarland.de/~steiner/
  */