Strategies for creating prompt database for multilingual applications

Recently a customer contacted me with a question about creating multilingual applications with Speech Server.  He had an application that ran in English and US Spanish and noticed that, while changing the synthesizer would switch TTS between languages, it had no effect on the prompt database used.

The workaround he used for this was to place both the English and the Spanish prompts in different prompt databases within the same prompt project.  He quickly realized that this would not work if a word was identical between Spanish and English so he added a tag for each prompt.  Then when playing the prompts he would specify the tag in the PEML, thus guaranteeing that he would receive an English or Spanish prompt.

In general this method worked for the customer and continues to work, but only because he uses very few extractions.  Even prompts that are very similar have separate recordings in order to preserve the flow in the prompts.

The one drawback with this approach is the language is set per prompt project - not per database.  Therefore both the Spanish and the English prompt databases in the prompt project will be compiled into a single .prompts file - which will be in English.  So what's the difference between an English .prompts file and a Spanish .prompts file?

The primary differences are in extractions and alignments.  In order to piece extractions together, the prompt engine makes use of the beginning and end phonemes of the extractions.  When piecing together a recorded prompt and TTS, the phonemes from the TTS engine are also used.  Therefore, if you place a Spanish prompt database in an English .prompts file, the prompt engine may be faced with the task of matching English and Spanish phonemes!

In truth, the chances of this occurring are rather rare, as you would only see an issue in cases where you have multiple extractions that are similar but the end result is if you have many extractions you may find that the matching is less than optimal.

The other issue comes in the form of alignments.  When a new audio file is imported into the recording studio, or you record directly into the recording studio, we call the appropriate recognition engine to determine where the word boundaries are.  Different languages have very different word boundaries (French for instance), so you may find that your word alignments are off.  Again, this only comes into play when using extractions but this will likely reduce the quality of your extractions, as audio that should be pronounced may be left out and audio that should be left out may be included.

It is quite easy to change the prompt database at runtime by setting the PromptDatabase property of the Workflow.  Therefore I recommend that you create a prompt project for each language and set the .prompts file at runtime.  This will likely require less effort than assigning a tag to each transcription and may prevent future difficult to debug issues down the road.