What’s new in Windows Speech Recognition?

Now that the Beta of Windows 7 is out, it’s time to talk about the improvements and new features in Windows Speech Recognition.

For Windows 7, we focused primarily on improving the user experience and removing the “rough spots” that we did not have time to fix in Vista.

First and foremost, we focused on performance. 

  • We rewrote the logic that builds the “say what you see” grammar to use the new native UI Automation API (instead of the MSAA IAccessible API).  This dramatically reduces the number of cross-process COM calls (by an order of magnitude), and speeds up the grammar generation by about 5-6 times.
  • The document harvester also has substantial performance improvements;
  • Building the “start application” grammar also runs much faster, as well.

Second, we focused on usability.

  • Dictation into TSF-unaware applications works much better than it did before.  Now, when you dictate into an unaware application, the dictation scratchpad appears.
    • You can use the scratchpad as a temporary document, and it is even voice, mouse, and keyboard-enabled; you can type, use the arrow keys for navigation, or use the mouse or voice commands to select and correct text before inserting the finished text into the unaware application.
    • If you don’t like the scratchpad, you can turn it off, and your dictations will be directly inserted into the unaware application.
  • Sleep mode works much better than it did before; false recognitions of “start listening” have been greatly reduced.
  • We simplified the transitions between OFF and SLEEP mode; for security reasons, we now default to OFF after “stop listening”; although users can change the default to SLEEP mode.  (We call this “voice activation” in the Control Panel and First Time User Experience.)

Third, we looked at accuracy.

  • The SR engine now uses the WASAPI audio stack, so we support array microphones and echo cancellation; this vastly improves WSR’s accuracy when used without a headset.
  • Document harvesting runs periodically, rather than just at startup; this lets the harvester pick up new documents as you create them, rather than having to wait for you to reload speech.
  • You can upload your training data to Microsoft, so that we can improve the recognizers in the future.  (You have to initiate this, incidentally; we will not upload any data without your explicit consent.)
  • The Chinese recognizer has substantial accuracy improvements as well.

Lastly, we did a few tweaks to the recognizer.  In Vista, 3rd party applications couldn’t tell whether the shared recognizer was ON or SLEEPing.  For Win7, there are new APIs that expose SLEEP mode.