The Power of Speech in Your Windows Phone 8.1 Apps - Text-to-Speech

If you’ve read the first post in the series, you are well versed in the ways of using voice commands to launch your Windows Phone app. If you completed the mini-challenges that I provided, then you’re well on your way to becoming an expert.

Once a user launches an app using Voice Commands, they may expect, or appreciate, if voice interaction continues as part of the user experience. In this post, we will look at how to provide voice readouts for information within the app.

Text-to-speech can be implemented using a simple string, or an XML-based markup language known as Speech Synthesis Markup Language (SSML).

The Main Parts

The Speech API for Windows Phone provides the following 3 main components to implement text-to-speech, or speech synthesis, in your app:

  1. SpeechSynthesizer – used to access the installed voices on the device, configure the speaking voice that will be used within the app, and generate the speech stream from text, as a SpeechSynthesisStream.
  2. SpeechSynthesisStream – stream that contains the speech that was generated from text
  3. MediaElement – framework element that is able to playback audio or video. In the case of text-to-speech, we will use the MediaElement to playback the speech output that was generated by the SpeechSynthesizer. Note that in Windows Phone 8.0, the MediaElement wasn’t needed to include speech readouts in your app. It just worked. But in Windows Phone 8.1 this has changed.

Putting It Together

The simplest approach to implement text-to-speech in your app is to use the default speaking voice that is set on the device, and passing in a string of text that must be read aloud. This takes a few lines of code in the code behind file, and an additional line in the page markup, as shown below.

XAML

 <MediaElement x:Name="SpeechMedia" AutoPlay="False" />
  

Code Behind

 using Windows.Media.SpeechSynthesis;
…
public async Task SpeakAsync(string textToSpeech)
{
    SpeechSynthesizer synthesizer = new SpeechSynthesizer();
    SpeechSynthesisStream synthesisStream = 
   await synthesizer.SynthesizeTextToStreamAsync(textToSpeech);
 
    if (synthesisStream != null)
    {
        this.SpeechMedia.AutoPlay = true;
        this.SpeechMedia.SetSource(synthesisStream, synthesisStream.ContentType);
        this.SpeechMedia.Play();
    }
            
}
  

Finally, don’t forget to include the Microphone capability in your application manifest.

With this in place, you simply need to decide when to invoke the SpeakAsync method, passing in the string you wish to have read out to the user. How you choose to invoke the speech readout is completely up to you: page navigation, through a button tap, etc.

However, if you are planning to invoke text-to-speech automatically, without the user explicitly tapping a button, be sure this is only done if the user actually invoked the app using voice commands to begin with. If you’re not sure how to do this, revisit my first post on Voice Commands which explains how to determine if the app was launched by voice.

Why SSML?

There will likely be scenarios where you would prefer to have more control over the speech output, such as configuring voice, language, pronunciation, volume, rate, emphasis on words or phrases, and even inserting audio files in the speech playback. This is where SSML comes into play.

For example, if I developed an app that was a story book reader for children, I would want the speaking voice to change as different characters in the story were talking. Using SSML markup, I could easily accomplish this as shown below:

 public async Task SpeakSsml()
{
    string ssml =
        @"<speak version='1.0' 
     xmlns='https://www.w3.org/2001/10/synthesis' 
      xml:lang='en-US'>
                <voice gender='female' age='30'>
          The big bad wolf arrived at the little piggy’s home, knocked on the door and said
     </voice>
                <voice gender='male' age='65'>
                    <prosody rate='slow'> 
            Little Pig, Little Pig, let me in.
            </prosody>
                </voice>
                <voice gender='male' age='10'>
                    <prosody pitch='high' rate='-20%' volume='100'>
                         Not by the hair of my chinny chin <emphasis level='strong'>chin</emphasis>
                    </prosody>
                </voice>
                        
        </speak>";
 
 
    SpeechSynthesizer synthesizer = new SpeechSynthesizer();
    SpeechSynthesisStream synthesisStream = await synthesizer.SynthesizeSsmlToStreamAsync(ssml);
 
    if (synthesisStream != null)
    {
        this.SpeechMedia.AutoPlay = true;
        this.SpeechMedia.SetSource(synthesisStream, synthesisStream.ContentType);
        this.SpeechMedia.Play();
    }
}
  

Notice that the implementation in the application is identical to the code listed in the previous section, except I make a call to the SynthesizeSsmlToStreamAsync method on the SpeechSynthesizer instance, passing in my SSML string.

This is just a basic example of what can be done using SSML markup and speech synthesis within your apps. To read about all of the available elements that can be used in your SSML string, check out the Speech Synthesis Markup Language Reference on MSDN.

Next Steps

In this post, we covered how to enable your app to talk to the user using the Speech API in Windows Phone, using only a few lines of code. In the next post, we will discuss how to include speech recognition to allow your user to talk back to your app to provide a truly engaging user experience.

Until then, take a few moments to include text-to-speech in the Windows Phone app you were working on since the last post.