DrawInstancedPrimitives in XNA Game Studio 4.0

When writing about future features, there is a danger things might change in between the time I write about them and when our product ships. For this reason, I have avoided going into too much detail about a couple of planned Game Studio 4.0 features that were not yet entirely implemented and therefore still at risk.

I finished the new DrawInstancedPrimitives API on Windows a couple of weeks ago, and just now checked in the Xbox implementation, so I figure this is a good time to talk about it.

Mesh instancing is an important performance optimization for many games, but was not exactly consistent across platforms. Our instancing sample shows four approaches:

  • State batching: use Effect.CommitChanges to reduce unnecessary state setting
  • Shader instancing: works everywhere, but requires replicated vertex and index data, which wastes memory
  • VFetch instancing: wastes some memory, although less than shader instancing, but only works on Xbox
  • Hardware instancing: no wasted memory, but only works on Windows and when using shader model 3.0


Game Studio 4.0 still supports variants of all these techniques, but adds a new, easier to use and more portable version of the hardware instancing API.


State batching in Game Studio 4.0

Game Studio 4.0 merges Effect.Begin / End and Effect.CommitChanges into a single EffectPass.Apply API. This was not fully optimized in our CTP release, so games that used state batching saw a performance hit. In our final version, EffectPass.Apply is optimized to be smarter about how much device state needs to be updated when the same effect is applied many times in a row, so state batching has the same performance as in previous releases, and some less carefully optimized drawing code now runs faster.


Shader instancing in Game Studio 4.0

If you are targeting Windows or Xbox, you can implement shader instancing the same way as before. Because Windows Phone does not support programmable shaders, you cannot use the exact same technique on the phone, but the new SkinnedEffect class can provide the same result with similar performance. Replicate many copies of your vertex and index data, the same as for shader instancing in previous Game Studio versions, adding bone indices and weights vertex channels, with each vertex weighted 100% to a single bone. Pass your instance transforms to SkinnedEffect.SetBoneTransforms, and draw using SkinnedEffect with WeightsPerVertex = 1. Tada! The same result as shader instancing, but this way works on the phone.


VFetch instancing in Game Studio 4.0

If your game is exclusive to Xbox, you can still use VFetch instancing, but I don’t know why you would want to. The new DrawInstancedPrimitives API is generally a better choice.


Hardware instancing in Game Studio 4.0

Our old hardware instancing API, which was exclusive to Windows and shader model 3.0, has been replaced with a new DrawInstancedPrimitives API:

  • No need to replicate vertex or index data
  • Store model vertices in one vertex buffer
  • Store instance transforms in a second vertex buffer
  • Render any number of instances in a single draw call
  • When using shader model 3.0 on Windows, this maps directly to the native DX9 hardware instancing APIs
  • When using earlier shader models on Windows, we emulate this functionality, so everything still works the same, just not as fast
  • On Xbox, we use cunning magic (powered by unicorns) to make this work fast with the same behavior as Windows
  • This API requires custom shaders, so cannot be used on Windows Phone

Here is the instancing shader from one of my unit tests (simplified to remove irrelevant things like lighting computations):

    float4x4 WorldViewProj;

    void InstancingVertexShader(inout float4 position : POSITION0,
                                in float4x4 world : TEXCOORD0)
        position = mul(mul(position, transpose(world)), WorldViewProj);

The test creates a simple cube model, and also a DynamicVertexBuffer using a custom vertex type which encodes a 4×4 matrix as a set of four float4 texture coordinates:

    VertexDeclaration instanceDecl = new VertexDeclaration
        new VertexElement(0,  VertexElementFormat.Vector4, VertexElementUsage.TextureCoordinate, 0),
        new VertexElement(16, VertexElementFormat.Vector4, VertexElementUsage.TextureCoordinate, 1),
        new VertexElement(32, VertexElementFormat.Vector4, VertexElementUsage.TextureCoordinate, 2),
        new VertexElement(48, VertexElementFormat.Vector4, VertexElementUsage.TextureCoordinate, 3)

The instanced drawing code is now pretty simple:

    instanceVertexBuffer.SetData(instanceTransformMatrices, 0, numInstances, SetDataOptions.Discard);
    graphicsDevice.SetVertexBuffers(modelVertexBuffer, new VertexBufferBinding(instanceVertexBuffer, 0, 1));
    graphicsDevice.Indices = indexBuffer;

    graphicsDevice.DrawInstancedPrimitives(PrimitiveType.TriangleList, 0, 0,
                                           modelVertexBuffer.VertexCount, 0,
                                           indexBuffer.IndexCount / 3,

Comments (21)

  1. mike says:

    Very nice. I've always thought that instancing was much more of a pain than it needed to be. The new API looks just right to me.

  2. Renaud Bédard says:

    "When using earlier shader models on Windows, we emulate this functionality, so everything still works the same, just not as fast"

    I'd be curious to know what magic happens in that case. Does it just do as many draw calls as there are instances? Or something in-between that still gives reasonable performance?

    Also, on Xbox, does the Unicorn Processing ™ allow as many instances per pass as one wishes, or are we locked to shader constants somehow? I know it happens under the hood, but it matters when you try to give something like 1000 instances.

    In any case, the new API looks FANTASTIC. There's no way I'm porting Fez to 4.0, but I wish it was like that from the beginning. 😀

  3. ShawnHargreaves says:

    > I'd be curious to know what magic happens in that case. Does it just do as many draw calls as there are instances?

    Aye. Nothing clever there, but it can still be handy to use the same codepath even if you don't get the perf benefits on lower spec machines.

    > Also, on Xbox, does the Unicorn Processing ™ allow as many instances per pass as one wishes, or are we locked to shader constants somehow?

    There's no limit, I guess other than how big a vertex buffer you have memory for to hold the instance matrices.

  4. Alex says:

    The new API look great Shawn. My only concern so far are the changes that may affect deferred shading. But we'll see, maybe they are not that big of a deal.

  5. Bunkerbewohner says:

    That's great! The DrawInstancedPrimitives API really simplifies mesh instancing a lot. I will definitely make use of that. I don't even have to change much of my code to use it then. Thanks!

  6. David Black says:

    It would be interesting and useful to know more about the unicorns…

    ie I would imagine some sort of shader re-writing to emulate the instancing API with vfetch. But does this happen dynamically or statically? Is this the reason for the effect format change(or just additional validation info)? How does this impact size/performance? I can see potential problems from both shader bloat or dynamic generation when first using instancing with an effect…

  7. Tim James says:

    Great sounding API there! Id love to know how you run your unit tests

  8. Great, however do we have to use TEXCOORD0 through 3?

    My own instancing scheme uses POSITION1 through POSITION4. This would sit a lot better if you are mixing with existing vertex data.

  9. Actually, ignore me, I just looked up the VertexBufferBinding class in the MSDN. 🙂

    Thank you for abstracting stream frequency up like that. It was a right nightmare emulating it myself (on the 360 🙂

  10. "This API requires custom shaders, so cannot be used on Windows Phone"

    Huh. I was thinking that Windows Phone did support custom shaders but did not expose them to programmers, in which case this API should have been able to use a custom shader internally like with the skinned model, dual texture, etc. shaders. Was I mistaken?

  11. Martin Caine says:

    This looks pretty cool, just getting into instancing now and this looks like it'll simplify the whole thing a great deal. My only question is when can we expect to have this on the Xbox with XNA? :-p

  12. Nick says:

    One quick question

    Should it has a pair api where it does not require index buffer? say..

    DrawInstancedPrimitives (acts like DrawPrimitives)


    DrawIndexedInstancedPrimitives (which is what already provided)

    Thank you.

  13. ShawnHargreaves says:

    We do not support non-indexed instanced rendering. The number of situations where that would be useful are very low (and can all be emulated using a 1 -> 1 index buffer) so it didn't seem worth putting in the time to implement that.

  14. spongman says:

    can we have a way to construct a VertexDeclaration without having to do all the Marshal.SizeOf() stuff for the VertexElement constructor?

    Surely, for most cases, the VertexElement should know how big its offset increment should be (based on its type). And the VertexDeclaration should be able to sum the offsets increments itself. Obvisouly, there should be a way to customize the element offsets, but it seems that in probably 99% of the cases, these can be derived.

  15. StatusUnkown says:

    Hi Shawn,

    I've been spending all day getting instancing working. I would highly appreciate if the instanceFrequency property of VertexBufferBinding was better documented. I'm still not entirely sure how it maps to the DX9 stream frequency APIs (which somehow make more sense to me…). Certainly the docs as they stand are very misleading – "The number of instances to draw in each draw call" naturally had me set the instance count. It was only after carefully going over your sample code above that I tried setting it to 1.

    Anyway. I've got some basic instancing working, however looking at PIX, I'm seeing the following (on DX10 class hardware):

    965 <0x05110048> IDirect3DDevice9::SetStreamSource(1, 0x0022BCD8, 0, 0) 5137202823

    966 <0x05110048> IDirect3DDevice9::DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 7, 0, 12) 5137219281

    967 <0x05110048> IDirect3DDevice9::SetStreamSource(1, 0x0022BCD8, 64, 0) 5137260143

    968 <0x05110048> IDirect3DDevice9::DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 7, 0, 12) 5137274898

    969 <0x05110048> IDirect3DDevice9::SetStreamSource(1, 0x0022BCD8, 128, 0) 5137285681

    970 <0x05110048> IDirect3DDevice9::DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 7, 0, 12) 5137315193

    971 <0x05110048> IDirect3DDevice9::SetStreamSource(1, 0x0022BCD8, 192, 0) 5137325408

    972 <0x05110048> IDirect3DDevice9::DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 7, 0, 12) 5137340731

    973 <0x05110048> IDirect3DDevice9::SetStreamSource(1, 0x0022BCD8, 256, 0) 5137350946

    974 <0x05110048> IDirect3DDevice9::DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 7, 0, 12) 5137359459

    975 <0x05110048> IDirect3DDevice9::SetStreamSource(1, 0x0022BCD8, 320, 0) 5137367972

    976 <0x05110048> IDirect3DDevice9::DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 7, 0, 12) 5137376485

    977 <0x05110048> IDirect3DDevice9::SetStreamSource(1, 0x0022BCD8, 384, 0) 5137385565

    978 <0x05110048> IDirect3DDevice9::DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 7, 0, 12) 5137393511

    979 <0x05110048> IDirect3DDevice9::SetStreamSource(1, 0x0022BCD8, 448, 0) 5137402591


    With no calls to SetStreamSourceFreq. Clearly it's hitting fall-back code here. Is this a limitation of the beta? or am I doing something wrong?


  16. ShawnHargreaves says:

    StatusUnknown: what shader model are you using? True hardware instancing requires shader model 3.0.  If you are using 2.0 shaders, that will force the fallback emulation rendering path.

  17. StatusUnkown says:

    Quite right. I'm auto-generating instancing vertex shaders, so they were mostly vs_2_0 like their source techniques. Worked fine in 3.1. 🙂

    I've been very impressed with the improvements in XNA 4. The extra state validation alone has meant I've cut out a tonne of code

  18. Eric Cosky says:

    It would be great to see the example updated to include the best practices technique implemented for the Windows Phone.

    Thanks for the great info here.

  19. Eric Cosky says:

    Here's an example of the SkinnedEffect technique in case anyone else is looking, crumpledcode.blogspot.com/…/shader-instancing-on-wp7-using.html

  20. CircleZebra says:

    Ha!  With a little help from the forums, it turns out that if you want this to work for multiple MeshParts, you need to unset the vertex buffer for each Draw call. It seems setting the indexed vertex buffer is what resets the loop index back to zero or something.

    So if you have multiple mesh parts, add this first line:


    // Tell the GPU to read from both the model vertex buffer plus our instanceVertexBuffer.


  21. Hi Shawn,

    > If your game is exclusive to Xbox, you can still use VFetch instancing, but I don't know why you would want to. The new DrawInstancedPrimitives API is generally a better choice

    In a non-general case, I've been finding significant cpu usage for large numbers of instances, even when in static buffers (details here: forums.create.msdn.com/…/83535.aspx). Any chance of putting us out of our misery guessing what DrawInstancedPrimitives does on xbox that might cause this? 🙂 Would you recommend a custom vfetch implementation for large instance counts?