New Audio APIs for Vista

In an earlier post, I mentioned that we totally re-wrote the audio stack for Windows Vista.  Today I want to talk a bit about the APIs that came along with the new stack.

There are three major API components to the Vista audio architecture:

  • Multimedia Device API (MMDEVAPI) - an API for enumerating and managing audio endpoints.
  • Device Topology - an API for discovering the internals of your audio card's topology.
  • Windows Audio Session API ((WASAPI) - the low level API for rendering audio.

All the existing audio APIs have been re-plumbed to use these APIs internally, for Vista, all audio goes through these three APIs.  For the vast majority of the existing audio applications, things should "just work"...

In general, we don't expect that anyone will move to these new APIs, they're documented for completeness reasons, but the reality is that unless you're dealing with extremely low latency audio (sub 20ms), or writing a control panel applet for a specific audio adapter, you're not likely to ever want to deal with them (the new APIs really are very low level APIs - using the higher level APIs is both easier and less error prone).

MMDEVAPI:

MMDEVAPI is the entrypoint API - it's a COM class that allows applications to enumerate endpoints and "activate" interfaces on them.  Endpoints fall into two general types: Capture and Render (You can consider Capture endpionts as microphones and line in, Render endpoints are things like speakers).  MMDEVAPI also allows the user to manage defaults for each of the types. As I write this, are actually three different sets of defaults supported in Vista: "Console", "Multimedia", and "Communications".  "Console" is used for general purpose audio, "Multimedia" is intended for audio playback applications (media players, etc), and "Communications" is intended for voice communications (applications like Yahoo! Messenger, Microsoft Communicator, etc). 

Windows XP had two sets of defaults (the "default" default and the "communications" default), we're adding a 3rd default type to enable multimedia playback.  Consider the following scenario.  I have a Media Center computer.  The SPDIF output from the audio adapter's connected to my home AV receiver, I have a USB headset that I want to use for VOIP, and there are stereo speakers connected to the machine that I use for day-to-day operations.  We want to enable applications to make intelligent choices when they choose which audio device to use - the default in this scenario is to use the desktop speakers, but we want to allow Communicator (or Messenger, or whatever) to use the headset, and Media Center to use the external receiver.  We may end up changing these sets before Vista ships, but this give a flavor of what we're thinking about.

MMDEVAPI supports an "activation" design pattern - essentially, instead of calling a class factory to create a generic object, then binding the object to another object, with activation, you can enumerate objects (endpoints in this case) and "activate" an interface on that object.  It's a really convenient pattern when you have a set of objects that may or may not have the same type.

Btw, you can access the category defaults using wave or mixer messages, this page from MSDN describes how to access them - the console default is accessed via DRVM_MAPPER_PREFERRED_GET and the communications default is accessed via DRVM_MAPPER_CONSOLEVOICECOM_GET.

Device Topology:

Personally, I don't believe that anyone will ever use Device Topology, except for audio hardware vendors who are writing custom control panel extensions.  It exists for control panel type applications that need to be able to determine information about the actual hardware. 

Device Topology exposes collections of parts and the connections between those parts.  On any part, there are zero or more controls, which roughly correspond to the controls exposed by the audio driver.  One cool thing about device topologies is that topologies can connect to other topologies.  So in the future, it's possible that an application running on an RDP server may be able to enumerate and address the audio devices on the RDP client - instead of treating the client as an endpoint, the server might be able to enumerate the device topology on the RDP client and manipulate controls directly on the client.  Similarly, in the future, the hardware volume control for a SPDIF connector might manipulate the volume on an external AV receiver via an external control connection (1394 or S/LINK).

One major change between XP and Vista is that Device Topology will never lie about the capabilities of the hardware - before Vista, if a piece of hardware didn't have a particular control the system tried to be helpful and provide controls that it thought ought to be there (for instance if a piece of hardware didn't have a volume control, the system helpfully added one).  For Vista, we're reporting exactly what the audio hardware reports, and nothing more.  This is a part of our philosophy of "don't mess with the user's audio streams if we don't have to" - emulating a hardware control when it's not necessary adds potentially unwanted DSP to the audio stream.

Again, the vast majority of applications shouldn't need to use these controls, for most applications, the functionality provided by the primary APIs (mixerLine, wave, DSound, etc) are going to be more suitable for their needs.

WASAPI:

WASAPI is the "big kahuna" for the audio engine.  You activate WASAPI on an endpoint, and it provides functionality for rendering/capturing audio streams.  It also provides functions to manage the audio clock and manipulate the volume of the audio stream.

In general, WASAPI operates in two modes.  In "shared" mode, audio streams are rendered by the application and mixed by the global audio engine before they're rendered out the audio device.  In "exclusive" mode, audio streams are rendered directly to the audio adapter, and no other application's audio will play.  Obviously the vast majority of applications will operate in shared mode, that's the default for the wave APIs and DSound.  One relatively common scenario that WILL use exclusive mode is rendering content that requires a codec that's present in the hardware that Windows doesn't understand.  A simple example of this is compressed AC3 audio rendered over a SPDIF connection - if you attempt to render this content, if Windows doesn't have a decoder for this content, then DSound will automatically initialize WASAPI in exclusive mode and will render the content directly to the hardware.

If your application is a pro audio application, or is interested in extremely low latency audio then you probably want to consider using WASAPI, otherwise it's better to stick with the existing APIs.

Tomorrow: Volume control (a subject that's near and dear to my heart) :)