How to create DTMF applications using the new TTS languages

OK, say you're a company in Rome, Taipei, Beijing, Rio de Janeiro, Paris, Madrid, Sydney, Tokyo, or Seoul that is excited about the new TTS languages supported by Speech Server 2007 and you want to create an application. What do you need to know?

The first, and most obvious point, is we currently only support DTMF applications for these languages. The reason is also obvious - we do not (yet) have speech recognizers for these languages. This, of course, does save some development time for grammars. :)

However, you will still need to create prompts. After all, applications that constantly use TTS can be very hard on the ears. So you'll open the .promptdb file in your project, start recording prompts, and most likely immediately come across some problems.

The following is what happens when one records a prompt in the Recording Editing and Design Studio.

1) You enter the text of your prompt and perhaps enter the extractions as well.

2) You then click the record button and record your prompt.

3) The recording studio will automatically create alignments for the words in the prompt, which will then be used for extractions. It does this by passing the .wav file and text to the recognition engine and has it recognize the text regardless of confidence.

4) We also receive the beginning and end phonemes for the words from the speech recognizer.

5) When you create extractions from the transcription, we use the alignment information to determine where in the .wav file each extraction begins and ends, and we also store the phonemes for the extractions. The phonemes may be used to better match extractions so they sound more fluid.

 

You can probably see the issue here. The issue is we do not have speech recognizers for these languages, so the alignments and phonemes will not be correct. The best recourse here is to choose as a recognizer the language that is closest to your target language. This way, the alignments will be as close as possible. Note that, even if the languages are quite different, alignments may not be a problem because we will accept very low confidence results from the recognizer. For reference, the five languages that we do have speech recognizers for are the following.

 

English (United States)

English (United Kingdom)

Spanish (United States)

German (Germany)

French (Canada)

 

The following is how I would suspect languages would best match. In other words, while you are developing a DTMF application for the first language, select the second as the recognizer. Note that, since the speech recognizer is not used for DTMF applications, your choice will only affect recorded prompt alignment and will not change the TTS quality of your application. Also note that these are my best guesses and they may not be correct.

 

Target language

Recognizer substitute

Spanish (Spain)

Spanish (United States)

Italian (Italy)

Spanish (United States)

French (France)

French (Canada)

English (Australia)

English (United Kingdom)

Mandarin (Taiwan)

English (United Kingdom)

Mandarin (PRC)

English (United Kingdom)

Korean (Korea)

English (United Kingdom)

Portuguese (Brazil)

Portuguese (Brazil)

Japanese (Japan)

English (United Kingdom)

 

I was rather torn on which language to use for Japanese, Mandarin, and Korean. It does seem like English phonemes are closer to these languages (in relative terms) than Spanish, German, or French, so I went with English. I picked English UK over English US because it does seem like English UK is more able to bend towards other languages than the US version. In truth, the best approach would be to experiment yourself. In the future, we will have speech recognizers for several of these languages and this problem will be solved (though of course you will have new problems).

 

It is also best if you do not depend on the phoneme matching feature of the prompt engine for other languages. When possible, use Ids to specify the extraction you wish to use when a conflict is possible.

 

Finally, when entering the text for a transcription in the recording studio, if you are using a character based language the prompt editor does not support word spacing. The best workaround is to simply place spaces between the words. Note that this is only for recorded prompts – the TTS engine will know how to provide spaces for words that are entirely in TTS.