A story about premature optimization

Performance never ceases to be a fascinating topic. This forum thread inspired Eli to write about foreach and the garbage collector, and also got me thinking about an ancient debate between the merits of multiplication and shifting.

Join me as we travel backward into the depths of time...

In the good old days of 320x200 VGA mode, graphics code often looked like this:

    void PutPixel(int x, int y, unsigned char color)
        videoMemory[x + y * 320] = color;

This sort of thing tended to happen a lot, easily a hundred thousand times per frame, so performance was important. Back in the early 90s, compilers were stupid and computers were slow. Specifically, integer multiplication was slow, which caused problems for programs that did a lot of multiplying of y coordinates by screen widths.

If your screen width happened to be a power of two, the multiply could be replaced by a left shift operation, which did the same thing an order of magnitude more quickly. But the standard VGA resolution was 320 pixels wide, not a power of two at all.

Some smart people noticed that 320 is the sum of 256 and 64, both of which are powers of two. Using algebra, you can rewrite the multiplication:

    offset = y * 320;
    offset = (y * 256) + (y * 64);
    offset = (y << 8) + (y << 6);

Hurrah! This version ran noticeably faster on the 286 and 386 processors of the day, because not only was shifting faster than multiplication, but it could also take advantage of the crazy LEA instruction provided by the x86 architecture.

So what is wrong with doing this? It turns out there are several reasons why replacing multiplies with shifts is not always a smart plan.

Simplicity. If you saw some code that evaluated (y << 8) + (y << 6), would you immediately be able to guess what it did? No, I wouldn't either, despite being familiar with this particular trick. Programs are already hard enough to understand without things like this to get in the way!

Maintainability. What happens a year from now when you want to update your drawing code to support the shiny new high resolution 640x480 mode? You could normally just search and replace "320" with "640", or perhaps change your constants to a variable so the same program can support multiple resolutions, but if you used this crazy shifting trick, it will be much harder to find every place where your code needs to be updated.

Compilers improve. Not long after people started using the shifting trick, the gcc compiler built this into its x86 code generator. Any time you multiplied a variable by a constant, the compiler would automatically figure out how to decompose that multiply into a series of shifts and additions. It would then count how many clock cycles that code would take, and compare this with how many clock cycles would be needed for a regular multiply (which can vary depending on the values in question and the exact processor it is running on, but the compiler was smart and knew all the rules). It would then pick whatever implementation was faster, always letting you write the simplest possible code to get the fastest possible result. The funny thing was, people went on manually converting their multiplies into shifts for years after the compiler was able to do it for them!

Hardware improves. On a modern CPU, a multiply instruction will often actually be faster than two shifts followed by an add. This happens for several reasons:

  • The multiplication unit has become smarter and faster.
  • With modern out-of-order execution, even if the multiply is slow this can usually be done in parallel with other work.
  • Again due to out-of-order execution, the shift-and-add pattern is a bad thing because it creates a data hazard between multiple instructions. Out-of-order hardware prefers a single slower instruction to several dependent ones, because it can be analyzed and pipelined more easily.
  • Modern hardware is often actually limited by cache and memory bandwidth rather than computational power, and the single multiply instruction takes up less space in the instruction cache.

If people decided to use the shifting trick, they ended up with confusing code that was hard to maintain. Although this did make their games run faster at the time, consider what happened for the people who ignored this optimization and just went on using a regular multiply:

  • At first their games ran slower than was theoretically possible. For some games that was a problem, but for many it was just fine.
  • They had a much easier time extending their games to run in SVGA resolutions.
  • Pretty soon the compiler got smarter, and their games became faster as a result of this (in many cases even faster than ones people had optimized by hand, because compilers are more patient and have better attention to detail than most humans).
  • As hardware changed the compiler got even smarter, and switched its code generation techniques to do the right thing for the new machines. The programmer never needed to know or care that this was going on.

So what relevance does this ancient story have for the modern XNA developer?

First off, if you are still manually converting your multiplies into shifts, please stop: that might have been a good idea in 1990, but makes no sense at all in 2007!

More importantly, next time you are debating the merits of using for vs. foreach, or the implications of boxing enum values, or the need to use the ref math overloads, stop to ask yourself, how much does this really matter to me? Will my program run visibly better as a result of this change? Am I optimizing for something that will always be true, or just for a quirk of one particular piece of hardware? Is this something that future improvements in compilers or framework code might be able to take care of for me? If I don't do this now, is it something I could go back and add later if it someday turns out to be important?

You can't predict the future, but you won't go too far wrong if you follow one simple rule: "say what you mean in the most direct way possible". Any time you find yourself expressing something in a less direct way because that will run faster, be aware you will need to revisit this optimization every couple of years to make sure it is still valid. It is amazing how often the simple and direct version will have improved to the point where it is as fast if not faster than your more complex optimized code.

Comments (6)

  1. JoelMartinez says:

    I have to wholeheartedly agree.  Confusing code is not worth the perf improvements (in most cases).  I think what needs to happen is that .NET developers in general need to become more acquainted with profiling techniques.

    If we are sure that we can find and optimize perf problems, then we’re likely to not bother with the premature stuff.

  2. Machaira says:

    Gee, this topic seems familiar, like we just talked about it a couple of days ago. 🙂

    It’s incredible how wrong some people are when they suggest optimizations. Thanks for attempting to clear the air. I’m sure people will just continue to believe they know that what they’re doing is good. I just hope I never have to maintain the atrocious code they produce.

  3. korona says:

    Aahh .. this takes me back to when I used to write software 3D rendering code for mode 13h.

    I actually also used the "awesome optimization trick" of using lookup tables for mostly everything. Notably trig functions and, yes, VGA memory offsets. I’d have something like

    int offsetTable[200];

    And use it as such:

    videomemory[x * offsetTable[y]] = color;

    It was LIGHTNING FAST!! 😉

  4. Mykres Space says:

    Ok here we go with this weeks update… It has been a relatively quite week in the world of XNA; this must

  5. jasond says:

    I agree with everything this post says.

    However, it implies that — for my game being developed under XNA Game Studio, which uses the horrid floating point support within the Compact .NET Framework on the Xbox 360 — I should NOT optimize using fixed point, and should continue to use the floating point support, assuming that this problem with be fixed at some later date.

    However, when floating point code runs literally 60 times as slow (That’s a 50 MHz machine vs. the 3 GHz monster we’re expecting), it severely inhibits what can be achieved.  And the consensus on the XNA Forums is that the Compact .NET Framework is NOT going to be fixed any day soon.

    So, where does that leave us?

    Only one option:

    We have to optimize.  🙁

  6. vicky says:

    I agree with you but a common misconception is that optimized code is necessarily more complicated, and that therefore optimization always represents a trade-off. However, in practice, better factored code often runs faster and uses less memory as well. In this regard, optimization is closely related to re factoring, since in both cases we are paying into the code so that we may draw back out again later if we need to.

Skip to main content