Grabbing the output of the Microsoft Speech API text-to-speech engine as audio data

A while ago I wrote a post on Implementing a “say” command using ISpVoice from the Microsoft Speech API which showed how to use Speech API to do text-to-speech, but was limited to playing the generated audio out of the default audio device.

Recently on the Windows Pro Audio forums, user falven asked a question about how to grab the output of the text-to-speech engine as a stream for further processing.

Here’s how to do it.

The key part is to use ISpStream::BindToFile to save the audio data to a .wav file, and ISpStream::SetBaseStream to save to a given IStream. Then call ISpVoice::SetOutput with the ISpStream, prior to calling ISpVoice::Speak.

            ISpStream *pSpStream = nullptr;
            hr = CoCreateInstance(
                CLSID_SpStream, nullptr, CLSCTX_ALL,
            if (FAILED(hr)) {
                ERR(L”CoCreateInstance(ISpVoice) failed: hr = 0x%08x”, hr);
                return -__LINE__;
            ReleaseOnExit rSpStream(pSpStream);
            if (File == where) {
                hr = pSpStream->BindToFile(
                if (FAILED(hr)) {
                    ERR(L”ISpStream::BindToFile failed: hr = 0x%08x”, hr);
                    return -__LINE__;
            } else {
                // stream
                pStream = SHCreateMemStream(NULL, 0);
                if (nullptr == pStream) {
                    ERR(L”SHCreateMemStream failed”);
                    return -__LINE__;
                hr = pSpStream->SetBaseStream(
                if (FAILED(hr)) {
                    ERR(L”ISpStream::SetBaseStream failed: hr = 0x%08x”, hr);
                    return -__LINE__;
            hr = pSpVoice->SetOutput(pSpStream, TRUE);
            if (FAILED(hr)) {
                ERR(L”ISpVoice::SetOutput failed: hr = 0x%08x”, hr);
                return -__LINE__;

Updated source and binaries attached.


say “phrase” [–file <filename> | –stream]
runs phrase through text-to-speech engine
if –file is specified, writes to .wav file
if –stream is specified, captures to a stream
if neither is specified, plays to default output

Here’s how to generate a .wav file (uh.wav attached)

>say.exe “uh” –file uh.wav
Stream is 1

And here’s how to generate an output stream. The app consumes this and prints the INT16 sample values to the console. uh.txt attached.

>say.exe “uh” –stream
Stream is 1
       0        0;        0        0;        0        0;        0        0
       0        0;        0        0;        0        0;        0        0

      86       86;    -1052    -1052;    -2839    -2839;    -3774    -3774
   -4199    -4199;    -4581    -4581;    -4284    -4284;    -3640    -3640
   -3100    -3100;    -2011    -2011;     -393     -393;      533      533

EDIT September 22 2015: moved source to github

Comments (3)

  1. Francisco Aguilera says:

    Great article!

  2. Francisco Aguilera says:

    If I had to be nitpicky, I would say, however, not to use "where" as a variable name as it is a keyword.