What is DMA (Part 4) - Common Buffer

The DMA API also allows you to create a section of kernel memory which you can share between your driver and your device. This memory is known as "common buffer", and has a variety of uses with modern PCI devices. You can allocate a piece of common buffer by calling the AllocateCommonBuffer function in your DMA_ADAPTER object. This function takes a length and returns the virtual and logical address of your new buffer.

Common Buffer has four unique attributes that make it useful:

  1. The buffer is physically contiguous.
  2. The buffer is created in a physical address range that your device can access.
  3. Changes your driver makes to the common buffer are visible by your device and vice versa.
  4. You don't need an available map register to make use of it.

The first two attributes cannot be reproduced with any other WDM DDI. MmAllocateContiguousMemory is the closest competitor, but because it's not tied into the HAL it can't determine what the correct range of physical addresses are for your device. The second two are what make this really useful as a shared buffer.

The biggest downsides of common buffer are that it can't be allocated at DISPATCH_LEVEL, that it's hard to get because physical memory fragments quickly, and it can be a scarce resource so you don't want to allocate huge amounts of it. Because of the first two issues you'll probably want to allocate a slab of common buffer during device initialization and then sub-allocate blocks out of that for the various operations. This can be simple if you can break the common buffer into fixed size blocks (you could then stick them on a lookaside list) or you may find yourself writing your own malloc/free functions.

Because of the last limitation you may find yourself required to scale down what your device can do so you're not allocating 1GB of common buffer. If you're splitting it up into command packets then the amount of common buffer you allocate will limit the number of requests you can send to the device at one time.

Using Common Buffer to hold command packets

The fact that changes made by one side are visible to the other allow you to store commands in the shared section. Let's take as an example something that was common in storage adapters (several years ago when I worked with them). Your driver writes a small command packet for the device, which contains:

  • The parameters for the command (a SCSI command descriptor block)
  • A "driver context" parameter which the device uses to report completion (you might use the virtual address of the packet)
  • Space to save the result of the operation (number of bytes transferred, status, etc..)
  • Some scratch space for the device to use (to hold the state of the operation, pointer to the next operation, etc...)
  • The list of logical addresses & lengths which make up the data buffer (the "scatter-gather list"). 

Setting up the packet will probably be ignored by the device. When you're ready to start the operation you write its address to a register on the device. This triggers the device to start processing the command. When the device is done it interrupts. Your driver reads the "driver context" value of the request that completed from a register, reclaims the packet and completes the original request.

In this example there are two things that I'd like to point out about the usefulness of common buffer. First - having the shared memory section setup makes it very simple and efficient to get these command packets and share them with the device. Doing this with non-paged pool would be much more complex since you'd need to call the DMA DDIs first to get a logical address for the packet (which could copy it into a bounce buffer) and call it again when you were done to undo the translation.

Second - since you didn't have to call the DMA DDI to translate your buffer and get a logical address, you didn't have to worry about whether you could find a free map register. Remember that all of your attempts to translate buffers compete for the same pool of map registers. If you were to translate the data buffer and then the command buffer you could end up in a deadlock situation - the data buffer translation could suck up the available registers but you can't release them until the command buffer translation completes. There are some tricks you could use to fix this, but it's better to keep the command data in the common buffer.

Using Common Buffer to setup a continuous transfer

Let me first admit I've never done this before myself. It's more of an audio thing than a storage thing, but I think I can still explain the idea.

Say you have a device which processes data in a continuous stream. The best example of this might be a sound card which runs in a DMA loop, sucking up audio data and pushing it out to the speakers. Such a device might not interrupt when it's done with a particular "transfer" but instead interrupt every time it's done processing a particular amount of data. Rather than issuing individual commands with data buffers you would instead compose the data from various requests into a single stream of data for the device.

For such a device the traditional system of translating data buffers and programming scatter gather lists doesn't work very well. Once a buffer has been translated it can't be modified anymore (since some of the pages may have been copied into bounce buffers ... you can modify the buffer but the bounced pages won't be updated), so you can't do the composition.

Here is another place where common buffer can help you. You can setup your device to transfer from common buffer in a continuous loop and then copy the data to be processed into this buffer at the appropriate offset. Since you don't have to do any extra work to make the common buffer useful you could write your data into the buffer and the device will pick it up as it sweeps through. Assuming the device sweeps through the buffer at a predictable rate you should be able to figure out where to write the next bits of data as longs as there's some mechanism to synchronize your clock with the device's clock once in a while.

Using Common Buffer to coalesce buffers

The DMA DDI gives you two options for doing bus-mastered transfers. If you say you support scatter-gather I/O in your device description (when you get the DMA_ADAPTER) then the DMA DDI will leave a request physically fragmented. If you don't the DMA DDI will coalesce the entire thing into a single physically contiguous buffer for you, but it also serializes requests so that it doesn't need more than one buffer to do this.

If you want something in the middle then you're going to have to handle it on the own. Say your device can only handle 5 fragments for a given DMA operation but you get a request with 6 fragments. Or say you require the fragments to be page aligned, but you're trying to support chained MDLs (which, in summary, means that you may get buffer fragments that aren't page aligned). None of these can be handled by the Windows DMA engine.

To handle this your driver can, once again, turn to common buffer. If you get a request that you can't handle normally, you can attempt to sub-allocate a single block out of your common buffer and then copy data from the original buffer into the block you just allocated. Now you can program the device with a single physical address and overcome the limitation.

This appears to be a pretty common practice in the networking space, where chained MDLs can result in transfers that consist of several tiny fragments with the various headers attached to the network packet.