New Audio APIs for Vista

In an earlier post, I mentioned that we totally re-wrote the audio stack for Windows Vista.  Today I want to talk a bit about the APIs that came along with the new stack.

There are three major API components to the Vista audio architecture:

  • Multimedia Device API (MMDEVAPI) - an API for enumerating and managing audio endpoints.
  • Device Topology - an API for discovering the internals of your audio card's topology.
  • Windows Audio Session API ((WASAPI) - the low level API for rendering audio.

All the existing audio APIs have been re-plumbed to use these APIs internally, for Vista, all audio goes through these three APIs.  For the vast majority of the existing audio applications, things should "just work"...

In general, we don't expect that anyone will move to these new APIs, they're documented for completeness reasons, but the reality is that unless you're dealing with extremely low latency audio (sub 20ms), or writing a control panel applet for a specific audio adapter, you're not likely to ever want to deal with them (the new APIs really are very low level APIs - using the higher level APIs is both easier and less error prone).


MMDEVAPI is the entrypoint API - it's a COM class that allows applications to enumerate endpoints and "activate" interfaces on them.  Endpoints fall into two general types: Capture and Render (You can consider Capture endpionts as microphones and line in, Render endpoints are things like speakers).  MMDEVAPI also allows the user to manage defaults for each of the types. As I write this, are actually three different sets of defaults supported in Vista: "Console", "Multimedia", and "Communications".  "Console" is used for general purpose audio, "Multimedia" is intended for audio playback applications (media players, etc), and "Communications" is intended for voice communications (applications like Yahoo! Messenger, Microsoft Communicator, etc). 

Windows XP had two sets of defaults (the "default" default and the "communications" default), we're adding a 3rd default type to enable multimedia playback.  Consider the following scenario.  I have a Media Center computer.  The SPDIF output from the audio adapter's connected to my home AV receiver, I have a USB headset that I want to use for VOIP, and there are stereo speakers connected to the machine that I use for day-to-day operations.  We want to enable applications to make intelligent choices when they choose which audio device to use - the default in this scenario is to use the desktop speakers, but we want to allow Communicator (or Messenger, or whatever) to use the headset, and Media Center to use the external receiver.  We may end up changing these sets before Vista ships, but this give a flavor of what we're thinking about.

MMDEVAPI supports an "activation" design pattern - essentially, instead of calling a class factory to create a generic object, then binding the object to another object, with activation, you can enumerate objects (endpoints in this case) and "activate" an interface on that object.  It's a really convenient pattern when you have a set of objects that may or may not have the same type.

Btw, you can access the category defaults using wave or mixer messages, this page from MSDN describes how to access them - the console default is accessed via DRVM_MAPPER_PREFERRED_GET and the communications default is accessed via DRVM_MAPPER_CONSOLEVOICECOM_GET.

Device Topology:

Personally, I don't believe that anyone will ever use Device Topology, except for audio hardware vendors who are writing custom control panel extensions.  It exists for control panel type applications that need to be able to determine information about the actual hardware. 

Device Topology exposes collections of parts and the connections between those parts.  On any part, there are zero or more controls, which roughly correspond to the controls exposed by the audio driver.  One cool thing about device topologies is that topologies can connect to other topologies.  So in the future, it's possible that an application running on an RDP server may be able to enumerate and address the audio devices on the RDP client - instead of treating the client as an endpoint, the server might be able to enumerate the device topology on the RDP client and manipulate controls directly on the client.  Similarly, in the future, the hardware volume control for a SPDIF connector might manipulate the volume on an external AV receiver via an external control connection (1394 or S/LINK).

One major change between XP and Vista is that Device Topology will never lie about the capabilities of the hardware - before Vista, if a piece of hardware didn't have a particular control the system tried to be helpful and provide controls that it thought ought to be there (for instance if a piece of hardware didn't have a volume control, the system helpfully added one).  For Vista, we're reporting exactly what the audio hardware reports, and nothing more.  This is a part of our philosophy of "don't mess with the user's audio streams if we don't have to" - emulating a hardware control when it's not necessary adds potentially unwanted DSP to the audio stream.

Again, the vast majority of applications shouldn't need to use these controls, for most applications, the functionality provided by the primary APIs (mixerLine, wave, DSound, etc) are going to be more suitable for their needs.


WASAPI is the "big kahuna" for the audio engine.  You activate WASAPI on an endpoint, and it provides functionality for rendering/capturing audio streams.  It also provides functions to manage the audio clock and manipulate the volume of the audio stream.

In general, WASAPI operates in two modes.  In "shared" mode, audio streams are rendered by the application and mixed by the global audio engine before they're rendered out the audio device.  In "exclusive" mode, audio streams are rendered directly to the audio adapter, and no other application's audio will play.  Obviously the vast majority of applications will operate in shared mode, that's the default for the wave APIs and DSound.  One relatively common scenario that WILL use exclusive mode is rendering content that requires a codec that's present in the hardware that Windows doesn't understand.  A simple example of this is compressed AC3 audio rendered over a SPDIF connection - if you attempt to render this content, if Windows doesn't have a decoder for this content, then DSound will automatically initialize WASAPI in exclusive mode and will render the content directly to the hardware.

If your application is a pro audio application, or is interested in extremely low latency audio then you probably want to consider using WASAPI, otherwise it's better to stick with the existing APIs.

Tomorrow: Volume control (a subject that's near and dear to my heart) 🙂

Comments (22)

  1. Anonymous says:

    Boy I’m glad I’m a programmer. If I was a bulldozer, I’d have to live with WASABI in my ears for the next 5-10 years.

  2. Anonymous says:

    Do I understand correctly that DSound is now layered on top of this new API (WASAPI)?

  3. Anonymous says:

    <i>"MMDEVAPI is the entrypoint API – it’s a COM class…"</i>

    Still COM? Is there a specific reason you guys (and gals) made the API set a COM object? I had hoped that every new API set in Vista would be implemented as a .NET class. I thought this was the basic idea of WinFX.

  4. Anonymous says:

    This is cool. WASAPI sounds a lot promising.

    Will Device Topology APIs let developers do for example device aggregation? (when i have 2 audio devices with 2 input both, it sees them as one device with 4 inputs?)

  5. Anonymous says:

    As a music lover, audio enthusiast and all-round multimedia guy i’m simply loving these posts of yours. I simply can’t wait to take a look at these API’s and to experiment with the new audio engine. Seems like you guys really worked hard and seriously on this 🙂

    On thing i was wondering. Considering the endpoints you talked about, could you for example create a software endpoint that took 5.1 audio and encoded it to a AC3 stream and in turn redirected that after encoding to a SPDIF, in effect creating a software version of a Dolby Digital encoder?

    Just a thought that went through my head just now 😉

  6. Anonymous says:

    Well I’m a pro audio guy and this low latency stuff sounds like a winner. Keep up the good work.

  7. Dennis, EVERYTHING involving audio is layered over WASAPI.

    And the reason it’s all unmanaged is relatively simple: We don’t want every single application calling PlaySound to have to have the .Net framework injected into it.

    steamy, I don’t think so, but I’m not sure it matters – MMDEVAPI presents the inputs from all the different adapters as separate capture inputs – the endpoint abstraction allows you to divorce the capture/render device from the hardware.

    Stebet, that’s an interesting thought…

  8. Anonymous says:


    I’m doing research for an app that’s scriptable and able to mix multiple tracks together while controling the volume envelope by some mathematical means. The end resulting sound should be able to be written to a file.

    Which API do you think I should start lookin at?

    I’ve found some open source command line tools that can do the mixing, but not the envelope control yet.

    Thanks for your input.

  9. Minh, I’d look at DirectSound, it should have the flexibility to do what you need. The mixing tracks is easy (you just add the samples and clip), the volume envelope makes things trickier.

    Essentially you write your streams in direct sound secondary buffers, you should be able to tap off the primary buffer before it gets rendered (I think). If not, then you can certainly do this with DirectShow.

  10. Anonymous says:

    I just wanted to say that I appreciate that the effort that has been put into ensuring old wave APIs still work and rerouting them through new systems. It was nice when the multimedia APIs got rerouted through the kernel mixer and us luddites that still used waveIn/waveOut got the benefits of lower latency and sound device sharing.

  11. Anonymous says:

    Does this then support some obvious use cases for a family (shared) media computer. E.g. playing AC3 out of SPDIF to AV amp for the DVD being player while running the VOIP calls output to a USB handset?

  12. Anonymous says:

    you say you have (or plan) set default endpoints for different types. system/comms and multimedia. which is great, but how extensible will this be? for example my audio card has two seperate outputs, one i use for headphones and one goes across the room under the carpet to the hifi. now (assuming drivers that show them as seperate endpoints) how would it behave in these scenarios:

    1. playing music and a game, music to go over hifi (to entertain my guests) and game sound to go into headphones (for maximum frag count)? both would be considered MM

    2. playing said game but with VOIP built in (for the purposes of this example also assume this is branded MM). now i have three channels. but now two come from the same exe. suppose something funny happens and i say "hey guys listen to this!" could i press a button and have the VOIP stream suddenly out of the headphones and hifi, leaving the other two be?

    on the xbox you can usually choose on the fly to toggle the VOIP between the headset/speaker output, i’m assuming that this could be the case?

    basically a lazy game programmer could just bung everything into MM, but would a clever programmer have the ability to leave these sorts of things up to the user?

  13. Anonymous says:

    Does that mean, that all the "old" low-latency entrypoints (kernel-streaming either from user-space as well as kernel-space) will not exist anymore?

    How would a kernel-mode-driver that needs to stream audio to a soundcard work using Vista?

    Best regards,


  14. Anonymous says:

    When we plug our multichannel sound card into Vista Beta 1 we see a whole lot of render devices called "Speaker Out" and a whole lot of capture devices called "Line In". Is there any way under Vista to differentiate multiple outputs and inputs of the same type? I’m really pulling my hair out over this as it’s going to wreck our entire line of multichannel sound and radio capture cards.

  15. Anonymous says:

    Dans le monde de l&amp;#39;enregistrement audio num&amp;eacute;rique, la latence correspond au temps d&amp;rsquo;attente

Skip to main content