The MotoGP particle system worked in a similar way to this XNA Framework sample, using a vertex shader to animate point sprites with near-zero CPU overhead. We spawned particles when the bikes drove off the road, during wet weather, and for victory celebration burnouts. Here are a couple of examples:
The problem was fill rate. Xbox-1 had a unified memory architecture, which meant the CPU, vertex data, texture fetches, framebuffer, and depth buffer were all competing for the same limited bus bandwidth. MotoGP ran in 640x480 at (mostly 🙂 60 frames per second, and was one the few Xbox-1 games that enabled 2x multisampling. This meant we were bottlenecked by memory bandwidth. Multisampling was only possible because we aggressively compressed our data and had little overdraw, but a dense cloud of alpha blended particles could bring us to our knees.
Our solution was to draw particles to a 320x240 rendertarget, using premultiplied alpha, then scale the result back up to cover the screen, relying on bilinear filtering to smooth over the missing details. Because we were drawing at half width, half height, and without multisampling, this only cost 1/8 the bandwidth it would have otherwise required.
But what if a particle goes behind a solid part of the environment? The regular depth buffer will not work here, because it is sized 640x480 while we are drawing at 320x240.
The solution lay in the same unified memory architecture that caused our original bandwidth problems. Direct3D does not provide any way to resize depth buffers, but because we had low level access to a fixed hardware platform, we were able to hack the graphics driver, creating a 32 bit texture descriptor that pointed at the same physical memory as our 32 bit depth buffer. The texture bit layout was different to that of the depth buffer, but as long as we were careful not to perform any operations that would alter these bit patterns (such as texture filtering or pixel shader manipulations) we could now use the texturing hardware to manipulate the contents of our depth buffer.
Final rendering sequence:
- Draw the main scene at 640x480
- Use texture/depth trickery to scale the 640x480 depth buffer down to 320x240
- Draw particles to a 320x240 rendertarget, testing against the smaller depth buffer
- Scale the 320x240 particle image back up to 640x480, blending over the top of the main scene
This trickery was sometimes visible in the final game. Particles normally have soft edges, so you didn't notice their low resolution, but when they intersected the environment that created a hard edge where the jaggies could be pretty obvious. It looked fine as long as particles were in front of the environment (the most common case) but not so good when solid objects went in front of particles. Check out the aliasing where the smoke is occluded by the left arm of this rider:
But hey. The occasional artifact was a small price to pay for dense clouds of smoke at 60 frames a second!
It is interesting to consider the performance implications of this technique. Without it, the cost of drawing N particles would be proportional to:
(this *2 is because of 2x multisampling)
With our optimization, the cost becomes:
ScaleDownDepthBuffer + ClearParticleRenderTarget + N*320*240 + BlendWithMainScene
The BlendWithMainScene part was pretty much free, as we were able to combine it with an existing postprocess operation. And clearing the particle rendertarget was very cheap. But scaling down the depth buffer was quite expensive, because it had to read the entire 640x480 multisampled depth buffer in a way that was not particularly cache coherent.
Looking at these equations, you can see that when the number of particles is small, this 'optimization' is actually going to make things slower! Sure, we sped up the 'N' part of the equation, but when N is small, that is outweighed by the cost of scaling down the depth buffer. We are only saving time when N gets large enough to dwarf the fixed setup cost.
Maybe we should be clever and only use the 320x240 rendertarget when there were many visible particles?
We wanted to run at 60 frames per second all the time, even when there were many particles on screen. To maintain a steady framerate, we had to optimize for that worst case. It would be a waste of time to optimize for situations where there were no particles, because once we made our worst case run fast enough, these easier situations would automatically be fine. The goal is to never drop frames, because that is objectionable to the player. There are no bonus points for running faster than the monitor refresh rate!
Moral of this story: optimize for your worst case, not the average. A technique that performs consistently ok is better than one that performs superbly most of the time, but then occasionally spikes and drops frames.
In math terms: prefer techniques with a low standard deviation over ones that have a lower mean but higher variance.