Santa’s production line

I oversimplified when I described the GPU as a single elf named George.

In fact, a modern graphics card has a complex pipeline with hundreds of elves working in parallel. In the same way that the CPU records drawing commands into a buffer, then the GPU processes them while the CPU is free to get on with other work, each of these internal GPU pipeline elves is reading input data from a buffer, doing some computations, then writing output data to another buffer which is consumed by a different elf further down the chain.

This lets us subdivide the concept of being "GPU bound" based on which particular elf is causing the bottleneck. In the same way that optimizing your CPU code makes no difference if you are GPU bound, successfully optimizing GPU rendering depends on knowing which part of the pipeline you are trying to speed up.

So what exactly does happen inside the GPU? The details vary from card to card, but these are the most important stages:

  1. The vertex fetch unit reads vertex data from memory
  2. The vertex shader processes this data
  3. The rasterizer works out which pixels are covered by each triangle
  4. The pixel shader calculates the color of each pixel
  5. The texture fetch unit looks up any textures that were requested by the pixel shader
  6. The depth/stencil unit reads, tests, and updates the depth buffer
  7. The framebuffer stores the final output color, and applies alpha blending

Any of these may be your performance bottleneck, and it is tremendously useful to find out which. For instance if we learn our game is limited by vertex shader processing, we know to optimize that rather than wasting time trying to reduce the number of texture fetches. Or if we are limited by pixel shading, we could increase the number of triangles in our models without affecting the framerate!

So what factors affect the performance of each pipeline stage?

  1. vertex fetch
    • number of vertices
    • size of each vertex
    • whether vertices are well ordered for cache coherency
  2. vertex shader
    • number of vertices
    • length of vertex shader program
    • whether triangle indices are well ordered for cache coherency
  3. rasterizer
    • number of pixels rendered
    • number of interpolator values passed from vertex shader to pixel shader
  4. pixel shader
    • number of pixels rendered
    • length of pixel shader program
  5. texture fetch
    • number of pixels rendered
    • how many texture lookups per pixel
    • amount of texture data read from memory
      • mipmapped textures have way better cache coherency
      • DXT textures are smaller than uncompressed formats
    • type of filtering
      • anisotropic is the most expensive
      • trilinear is usually only a little slower than bilinear
      • bilinear and point sampling are often identical
  6. depth/stencil
    • number of pixels rendered
    • whether multisampling is used
    • read/write vs. read-only mode
  7. framebuffer
    • number of pixels rendered
    • whether multisampling is used
    • size of each framebuffer pixel (including MRT)
    • read/write (alpha blending) vs. write-only (opaque)

To identify the bottleneck, we need some way of altering just one of these contributing factors, and without changing our CPU code in any significant way (if a change affected CPU performance as well as GPU, that could invalidate our results).

Try running your game in a tiny resolution, say 100x50. This makes no difference to the CPU, vertex fetch, or vertex shader performance. Does the framerate improve?

If reducing the resolution does not affect performance (and assuming you are not CPU bound), your limiting factor must be vertex processing. You can speed up both vertex fetch and vertex shading by using fewer triangles in your models, or you could try to simplify the vertex shader. If you think vertex fetch might be the problem, run your models through a custom processor and use VertexChannelCollection.ConvertChannelContent to compress your vertex data into a PackedVector format. Normalized101010 is good for normals, and you can often get away with HalfVector2 for texture coordinates.

If reducing the resolution speeds things up, you must be limited by one of the pixel processing stages.

Try setting SamplerStates[n].MipMapLevelOfDetailBias to 4 or 5. If you do this right, and assuming you are using mipmaps (if not, add mipmaps straight away and watch your performance improve!) your textures will become blurry. If it boosts performance, you are limited by texture fetch bandwidth, in which case you can speed up your game by enabling DXT compression or using fewer/smaller textures.

Try changing all your pixel shaders so they just return a constant color. This will affect both pixel shader and texture fetch performance, but since we already tested texture fetching, we can deduce that if this boosts the framerate while the mipmap bias did not, the bottleneck must be pixel shader processing.

Still here? That means your bottleneck must be #3, #6, or #7.

Try enabling multisampling. If this makes no difference, you are limited by the rasterizer.

Try changing the framebuffer to a smaller pixel format such as SurfaceFormat.Bgr565. If this speeds things up, you are limited by framebuffer writes.

Otherwise, by process of elimination, it must be the depth/stencil.


I was going to write some suggestions about how to optimize for each possible bottleneck, but this post is long enough already. Please ask if you have questions about that...

Comments (5)

  1. CGomez says:

    Your hard work, advice, and knowledge that you share in these posts are nothing short of a public service to hobbyists.  I know you and the other hard working bloggers on the XNA team have work to do, but thank you for the time and effort to share your experience.

  2. Ultrahead says:

    Great post. Just a few minor things, if I may:

    "The texture fetch unit looks up any textures that were requested by the pixel shader". From shader model 3 and on, this can also be used by the vertex shader (know as "Vertex Texture Fetch").

    "Try running your game in a tiny resolution, say 100×50". But still on fullscreen, otherwise you’ll be also moving from "swapping" to "copying" the backbuffer (the former is faster, of course).

    My five, sir.

  3. fn2000 says:

    What are actual downsides of using vertex data compression? I assume we lost vertex data precision but how it really impacts real-life applications?



  4. ShawnHargreaves says:

    > What are actual downsides of using vertex data compression? I assume we lost vertex data precision but how it really impacts real-life applications?

    If your app is bound by vertex fetch memory bandwidth, reducing the size of the vertex data could give a big speed boost.

    If your app is bound by some other part of the GPU, this will make no difference to performancce (but will still obviously save memory).

  5. ashish aggarwal says:

    This is insanely simplified and so easy to understand.
    Thank you Sir!

Skip to main content