Multiple Ways to Render Point Sprites in DX11

In Direct3D 11 one can render a point sprite in several different ways.

Commonly, presentations that explain how to port DX9 apps to DX10 and 11 mention that point sprites are best done in GS. In many cases this is not the fastest way of doing screen aligned point sprites.

Below are the timings from my tests on an AMD 6-something in my dekstop. Each point sprite comes in from a vertex buffer as a point, then it’s expanded into a tri or a quad either in the VS or the GS or using the tesselator.

1. GS and VS triangles and quads.
In this method vertex shader passes the vertex data to the geometry shader that expands it into either triangles or quads.

2. GS only triangles and quads.
In this method the vertex shader is empty and the vertex is loaded by the GS. Given a raw byte view of a vertex buffer and the index of the point, the GS loads the vertex data manually first, then it proceeds as the geometry shader in method 1.

3. Tesselator triangles and quads.
The tesselator can be set up to generate triangles and quads. In this method, the vertex shader reads the vertex as usual, then it passes the data to the tesselator, which performs the expansion of the vertex into a point sprite. 

4. Manual vertex load in the VS that generates triangles and quads.
By drawing 3 times more vertices in case of triangles, and 6 times more vertices in case of quads, we can use SV_VertexID to figure out the vertex index and the corner index.  Then we can manually load the vertex from the raw byte view of the vertex buffer and move that vertex to the corner using the corner index. No GS is required in this case.

5. Using instancing to load vertices, VS triangles and quads.
Same as method 4, only instead of loading the vertex data manually, we rely on the geometry instancing to do it for us.

Results for 2.10 million particles in each run are below, timing in milliseconds along the Y axis, size in pixels along the X axis.

So, from the data above, using a GS isn’t the fastest way to render point sprites. In fact, a more or less competent tesselator in a GPU can be faster.

If you’re after rendering circles or stars or something like that, the tesselator can actually produce a perfect circle from a quad. I haven’t tested the performance of that method.

Comments (3)

  1. mihu says:

    So, why does relative performance depend on the size of point sprites? Could it be because of some non-obvious global scheduling issues?

  2. mihu says:

    Ok, now I looked closer at the graph and see that the biggest gap is between triangle- and quad-rendering methods groups. Other than that, differences are rather negligible.

  3. ivanne says:

    The bottleneck moves from the vertex stages to the pixel stage of the pipeline as the size of the triangles increases.

    Large triangles are less efficient than large quads because we need to kill more pixels, but small triangles are more efficient than small quads because we output more vertices for quads.

    Apart from the tri/quad difference, the difference in methods is very much measurable and noticeable. In my tests I can't see why anyone would want to use GS for the given scenario. Vertex Shader Instancing is as flexible and is consistently 10-15% faster than GS.

Skip to main content