Yesterday I discussed how to create extractions from transcriptions in order to reduce the amount of recording necessary. We ended with two transcriptions to replace the four original ones.
1) [You have purchased] [a clock]
2) [Would you like to purchase] [a radio?]
As mentioned there are some issues with these transcriptions. The first issue is the second transcription is a question while the first is a statement. This will change the intonation of both extractions within each transcription. Unfortunately there are no "tricks" that you can perform on the wave file. You must record separate transcriptions for the question and the statement. Our transcriptions now look like this
1) [You have purchased] [a clock]
2) You have purchased [a radio]
3) [Would you like to purchase] [a radio?]
4) Would you like to purchase [a clock?]
Now this doesn't look very helpful! We're back at the original four transcriptions! We can of course cut the recording time down a bit by removing the phrases in transcriptions 2 and 4. In this case we cannot reduce the number of transcriptions much further, but if you have multiple statements or multiple question phrases in your application you can still reduce the transcription count by a factor. Say we had the following two phrases.
1) [Would you like to purchase] [a clock]?
2) [Would you like to return] [a radio]?
In this case we only need to create two transcriptions, right? The answer is yes, but these are not exactly the transcriptions we will need. Say both of these phrases to yourself. Notice how the "s" and "n" sounds linger enough to have an effect on the "a" sound? This occurs because the phonemes (or in non-linguist terminology - the sounds) at the end of "purchase" and "radio" are both soft. When these extractions are placed together to make "Would you like to purchase a radio" it will sound disjoint. There are several ways we can fix this.
The easiest way to fix this is to replace the soft phoneme with a hard one - which has a more definite ending. The "k" sound in "Patrick" is a good example of this. The following transcriptions and extractions will sound more natural.
1) [Would you like to purchase] Patrick [a clock]?
2) [Would you like to return] Patrick [a radio]?
By placing the word "Patrick" between the extractions we replace the soft stop with a hard one. This will sound much more natural but we can go a step further if we want. Imagine we have the following transcriptions.
1) Would you like to purchase a clock?
2) Would you like to order a radio?
3) Would you like to pay for a watch?
So far we have seen that we can record this as nine transcriptions or the three above but with a small sacrifice for quality. Using six transcriptions, we can achieve almost the same quality as with nine.
1) [Would you like to purchase] [a clock?]
2) Would you like to purchase [a radio?]
3) Would you like to purchase [a watch?]
4) [Would you like to order] [a clock?]
5) [Would you like to order] [a radio?]
6) [Would you like to pay for] [a watch?]
Why does this work? This words because the end phoneme for "Would you like to order" and "Would you like to pay for" is the same. Because the end phoneme is the same we do not need to record an extra three transcriptions. But how will we know which extraction to use at run time? The answer is a feature of the Prompt Engine in Microsoft Speech Server. When searching for a match for a transcription the Prompt Engine takes into account the phoneme immediately before and after the extraction. You can view these phonemes (and edit them if you wish) using the Prompt Editor in the Speech SDK.
The truth is for many applications the "Patrick" method will work well and require fewer recordings. However if you want to go the extra mile for quality prompts you can still get by without recording every combination using the second method.