Speech 101 - Getting the Computer to Recognize “Hello World”

I recently started working on the Speech at Microsoft core runtime team. As part of the ramp up process, I needed to write up a simple C++ app using the Speech APIs (SAPI). Below you’ll find a simple walk through in what it takes to write up a native app to analyze input wave files. This sample can be the basis for a simple speech recognition feature in your projects.

Sound Check

What you’ll need for this example is a single wave recording of your voice saying “Hello World”. I’ve made all of the samples available here. The two main things you’ll need are “helloworld.wav” and the corresponding grammar file, “helloworld.xml”. The source is also included as a reference.

In addition, you’ll need the Speech Development APIs which can be found in the Windows SDK. The SDK includes both the managed and native APIs.

Crash Course in Speech Recognition

Before I jump into the example let’s look at what happens Speech Recognition from a very basic level:

image

The SAPI recognition engine takes a loaded WAV file and with the help of a custom grammar file, can determine if the input WAV file is actually recognizable speech. If the engine detects there is a phrase within the WAV, an event is fired and the resulting string is contained within. Developers need to give some guidance to the SAPI recognition engine through the use of a custom grammar file. To learn more about that in complete detail, W3C’s Grammar RFC contains all of the information.

Hello Computer, Hello World

Once the Windows SDK has been installed, launch Visual Studio and create a C++ Win32 console application project. Copy the two helper files to the project root directory, add them to the project and set the property to “Copy on build”.

First code part - enable COM in the project’s stdfax.h:

#define _ATL_APARTMENT_THREADED
#include <atlbase.h>

extern CComModule _Module;
#include <atlcom.h>
  

Next, in the main cpp file add the references to SAPI.h and some basic speech objects:

#include <sapi.h>
#include <sphelper.h>

int _tmain(int argc, _TCHAR* argv[])
{
     LPCWSTR MY_WAVE_AUDIO_FILENAME = LPCWSTR(L"helloworld.wav\0");
     LPCWSTR TEST_CFG = LPCWSTR(L"helloworld.xml\0");

     CComPtr<ISpStream> cpInputStream;
     CComPtr<ISpRecognizer> cpRecognizer;
     CComPtr<ISpRecoContext> cpRecoContext;
     CComPtr<ISpRecoGrammar> cpRecoGrammar;
     CComPtr<ISpRecoResult> cpResult;

If you haven’t already, call CoInitialize(NULL):

if(FAILED(::CoInitialize(NULL)))
{
     printf("CoInitialize failed... somehow...");
     return FALSE;
}

Now, it is time to initialize the input stream and bind it to a file:

    CSpStreamFormat sInputFormat;

     hr = sInputFormat.AssignFormat(SPSF_22kHz16BitStereo);
     if(FAILED(hr))
     {
         printf("input formatting failed");
         return FALSE;
     }

     //bind input stream to a file
    hr = cpInputStream->BindToFile(MY_WAVE_AUDIO_FILENAME,
                                     SPFM_OPEN_READONLY,
                                     &sInputFormat.FormatId(),
                                     sInputFormat.WaveFormatExPtr(),
                                     SPFEI_ALL_EVENTS);
     //hr check
    if(FAILED(hr))
     {
         printf("binding to file failed. check file value, etc... ");
         return FALSE;
     }

Once the input stream is set, initialize the Recognizer and configure its input:

    //Initialize the recognition Engine
    hr = cpRecognizer.CoCreateInstance(CLSID_SpInprocRecognizer);

     //hook up wav input to the recognizer
    hr = cpRecognizer->SetInput(cpInputStream, TRUE);

     //hr check
    if(FAILED(hr))
     {
         printf("Linking WAV to recognizer failed. ");
         return FALSE;
     }

Finish up configuring the SR objects by creating the grammar, setting interest level and wiring up the SR Win32 event:

    //Create recognition context
    hr = cpRecognizer->CreateRecoContext(&cpRecoContext);

     hr = cpRecoContext->CreateGrammar(NULL, &cpRecoGrammar);
     hr = cpRecoGrammar->LoadDictation(NULL, SPLO_STATIC);

     //Check for things that are recognized and the end of the stream
    hr = cpRecoContext->SetInterest(SPFEI_ALL_SR_EVENTS | SPFEI(SPEI_END_SR_STREAM),
                                     SPFEI_ALL_SR_EVENTS | SPFEI(SPEI_END_SR_STREAM));

     //hook up the Win32 event
    hr = cpRecoContext->SetNotifyWin32Event();
     //hr check
    if(FAILED(hr))
     {
         printf("Wiring up win32 Event failed");
         return FALSE;
     }

     //Activate Dictation
    hr = cpRecoGrammar->SetDictationState(SPRS_ACTIVE);

     bool fEndStreamReached = FALSE;

 

Specifically, the interest level in this example is to “All SR Events”. This means that a SR event will fire for everything that the recognizer thinks it can parse accurately.

Here’s the basic loop for handling the SR Event logic:

    //Do stuff, like analyze the wave or something...
    while(!fEndStreamReached && S_OK == cpRecoContext->WaitForNotifyEvent(10000))
     {
         CSpEvent spEvent;

         //extract queued events from reco context's event queue!
        while(!fEndStreamReached && S_OK == spEvent.GetFrom(cpRecoContext))
         {
             //Figure out the event type
            switch (spEvent.eEventId)
             {
                 //recognized something
                case SPEI_RECOGNITION:
                     //DO SOMETHING!
                    fFoundSomething = TRUE;
                     LPWSTR buffer;

                     cpResult = spEvent.RecoResult();
                     cpResult->GetText(SP_GETWHOLEPHRASE, SP_GETWHOLEPHRASE, TRUE, &buffer, NULL);

                     break;
                 case SPEI_END_SR_STREAM:
                     fEndStreamReached = TRUE;
                     break;
             }

             //Clear the event
            spEvent.Clear();
         }
     }

If everything went well, the buffer in the loop will have “Hello World” when the recognition event is fired. To double check, set a breakpoint on "fFoundSomething” and step through at that point to see if it recognized correctly.

To do things right, here’s the clean up code:

    //CLEAN UP TIME
    hr = cpRecoGrammar->SetDictationState(SPRS_INACTIVE);
     hr = cpRecoGrammar->UnloadDictation();
     hr = cpInputStream->Close();

     return TRUE;

SAPI Experiments To Try

There are bunch of things to try out with this example to try to get a better understanding on how the SR system works.  Here’s a quick list of things to try out:

  1. Record your own voice saying “Hello World” and see if it the app correctly recognizes you.
  2. Add more grammar and recognition rules to expand what the engine can find
  3. Comment out the grammar section and see the engine’s best guesses at different WAV inputs.

Resources and Thanks!

You can find the MSDN example, which is what most of this blog post is based on here

Special thanks goes out my new test colleagues who have helped me learn and guided me through this example!