Some Thoughts on Program Efficiency

Program efficiency at the programmer level is something of a complicated issue – in part because it is contextual. That is, it is hard to say something that holds true in all cases. [This is why having a good profiler is essential.] For example, the same implementation may be adequate, if embarrassing (should anyone actually look), under some circumstances and the cause of a crippling bottleneck under others. Consider the following constructor definition:

foobar::foobar( const bar b, const foo f )


      m_b = b;

      m_f = f;


This is a terribly naïve implementation:

·         The class parameters b and f are passed by value rather than by reference. This results in unnecessary temporaries being created and destructed.

·         The member class objects m_b and m_f are assigned to rather than initialized, resulting in their unnecessary default construction.

·         The constructor itself is a good candidate for inlining.

·         (And yes, the naming conventions leave something to be desired J)

An absolutely better implementation is the following:

inline foobar::

foobar( const bar &b, const foo &f )

    : m_b( b ), m_f( f )


but … well, does it matter in terms of the project? For argument’s sake, let’s say a summer intern wrote this.The bottom line: It runs correctly and there is no security breach. Let’s fantasize and say the project is under severe schedule pressure. So, on the one hand, we just feel an immense relief that the foobar component is functional and checked in. (Hey, it demos!)

If bar and foo are large, complex classes with costly copy semantics, and if there are lots of foobar objects generated in our application, then this is likely to be a significant performance gain. On the other hand, if foobar class objects are only rarely created, then the improvements to the code are not significant, and the code is no big deal. That is, the code is better but, overall, nothing has really changed. At what point do we stop making code better?

So, my point is: making rules in itself isn’t enough. Knowing where to rigorously apply those rules matters as well. As conventional wisdom has it: profile!

Sometimes, there is no real software solution, and anything coded at the application level is mostly just trying not to get in the way. For example, I was technical lead on the ToonShooter project at DreamWorks Feature Animation. It intended to replace a very old Objective-C Macintosh-based [at least 4 years old!] pencil-test playback system for animators with a spanking brand new C++ implementation under Linux. The goal was two-fold: (1) to deliver an actual system either for Spirit [already in production] or, if that proved unfeasible, for Sinbad, at that point in artistic development, and (2) to componentized the class hierarchies for subsequent reuse in other applications – playback, compositing, file-format, math and geometry, etc.  

At the heart of the system is a playback engine that must display the images at a minimum rate of 24 frames per second. If it can’t do that, then there is no system. The actual display of the images was done using an OpenGL command, glDrawPixels. Under SGI, it just hummed. Our initial test under Linux with a promised high-performance X Server and OpenGL implementation was a mind-boggling 2 seconds per frame. Turns out the OpenGL implementation was not making use of the underlying hardware, but doing hardware emulation in software.That’s like trying to play point guard without being able to dribble with your left hand.

This was before we even wrote a line of code. Although our project was to be implemented in C++, its success ultimately rested on finding direct hardware support of image blitting [that was not glamorous enough to get proper attention back then]. Until that was solved, our application couldn’t actually be deployed – at least not under Linux. Solving the display issue was a management issue. The point was, it blindsided everyone; we had jumped before confirming that the pool contained water.  

[Having access to the Linux OS source code was a mixed bag. When Andrew discovered a flaw in the clock in his work on synchronization, no one seemed ready to delve in and try their hand at fixing it. On the other hand, Andrew studied the sound drivers and found a number of bugs, so the source code in this case did benefit from his superior expertise, and we were able to quickly overcome a faulty implementation. We originally hoped for a digital image solution for bringing the artists’ drawing to the computer, but there was no implementation of a sufficiently fast protocol [that is probably the wrong word [this was my first task as a technical lead in which many facets of the project were beyond my technical competence, and I owe Ed Leonard a great deal of gratitude for his overcoming my reluctance to take on such a position] and we could not get any commitment as to when I think it is called firewire would come to the OS. Since no one in-house was in a position to do the implementation itself, we ended up with a analog video solution.]  

[HP rode in on a white horse, as it happened, up ending various other vendors challenged to come in and provide a solution [unfortunately, Microsoft was not an active player back then]. It felt truly heroic at the time, and HP have justly flourished within that domain space since. We delivered a componentized system in time for it to be successfully used by Spirit, saving them millions of dollars in animators’ time. It was probably my most satisfying experience as a developer. Unfortunately, the movie did not do well at the box office, and the company has since stopped doing traditional 2D animation so I guess the ToonShooter’s shelf life is over.]

Compositing proved a fertile ground for optimization, but a somewhat humbling one, as it turned out. In ToonShooter, once the pencil drawings are captured and the sound file imported, the animator plays back the scene, checking image flow and correct lip-synching. Levels are turned on and off as the different elements are added to or removed from viewing. Compositing is done on the fly as it delivers frames to the playback engine.

Even when we got the playback at speed, we had to insure that the compositing was able to deliver the image. Obviously, in part the success of this depends on the number of levels requiring compositing. For example, one level worked without a hitch [J]. Actually, we supported – I think we tried up to 3 levels without dropping frames in the prototype – hey, that’s as far as we pushed it! [We only had to demo it at that point!]. But we knew this was something critical we need to get a handle on before it became an issue. [Some animators we talked to in effects were talking of 30-50 levels – I mean, we’re talking glints of sunshine the size of a thumbnail, or the march of wavy lines across water, but still. It would have fallen on my watch if we couldn’t handle it.

For a grey-scale rough animation pencil drawing captured at 8 bits, compositing can be reduced to choosing the minimum of two values. That is, for each pixel, which is an unsigned char, retain the lower value (black is 0, white is 255). However, for a 640×480 image, that’s still 3K * n-1 levels traversals comparing pixel values. 16 levels would probably be stretching the just in-time delivery of frames. We kept poking at the code to see how we might further tweak it.

What I had forgotten is that sometimes the answer to optimizing code is finding a way to actually not execute it.

Part of the problem was my unfamiliarity with the domain. When our current head of technology looked at the code, he had an obvious answer: Rather than doing the arithmetic, do an index into a table of the 255 values. At first I didn’t get it, because I didn’t realize there is a fixed set of constant values that each pixel could represent. Once I got it, the solution was obvious.  (It always is. I’m just not used to being the one with my mouth hanging open.) In fact, it’s a canonical solution anecdotally related in Jon Bentley’s excellent text, Writing Efficient Programs, published by Prentice-Hall.

There is actually a trick to doing this in C++. What you don’t want to do is build up the table in a constructor – even an inline constructor. Why? Well, a constructor is a run-time entity. The look-up table has to be independent of any class object, and so it is made a class static member. This is, under any kind of constructor scheme, the table gets `statically initialized’ prior to the beginning of main(). So even if you hard-coded each value, setting up the table is a run-time activity.  It is significantly more efficient to code a program to write out the table as a built-in array of constant expressions into a program text file and then compile that as part of your build.

// Fill the min table

      int min_lut[size][size];

      for ( int i = 0; i < size; i++ ) {

            for ( int j = i; j < size; j++ )

                    min_lut[i][j] = min_lut[j][i] = i;



      // Write out the tables

      ofstream text_file( “” );

      text_file << “// This file is created by …”;

      text_file << “#include <Compositor.h>\n\n”;


      // this is a static member of Compositor …

      text_file << “unsigned char Compositor::minLUT[256][256] = {\n” );


      for ( int i = 0; i < size; i++ ) {

            text_file << “{ ” ;

            for ( int j = 0; j < size; j++ ) {

                    text_file << min_lut[i][j] << “, ” ;

      // etc.


So. We can speed up/optimize image compositing up to a point. After that point, it levels out. Once we reach that point, the only workable solution is to avoid doing any compositing at all – or at least reduce it to a subset of the required compositing. [Actually, an even better solution is to move the operations to hardware with cool and nifty video cards, etc.]

The first step in achieving this is to cache composited images. Before we invoke the compositor, we check that the resulting image is not already available in the cache. Failing an exact match, we look for a best match – perhaps 2 or 3 of the 4 required levels have already been composited, and so we only composite the missing level(s). To achieve this in a reasonable fashion, we need to be able to efficiently tag the different composited images (by frame by active levels).

For this caching strategy to be effective, of course, the mechanism must be (considerably) cheaper than the process of compositing itself. The bitset and map classes make image tagging and retrieval reasonably simple. [For some reason, the gcc Linux compiler at that time didn’t support bitsets – apparently there was some contention about whether they warranted inclusion in the ISO standard under development. Back then, you could not use Linux for serious development without using gcc so I ended up having to incorporate the free SGI implementation of bitsets. ]

When an image is about to be composited, a bit is turned on for each level that is active. That value can then be turned into an unsigned long, which can be used as a unique key in a map:

  map<unsigned long, Image*>;

  int main()


    bitset< 16 > mbit;

    CManager pcomp;

    Image *pim = 10000;

    unsigned long search_key;


    for ( int frame = 1; frame < 12; ++frame ){


        for ( int ix = 0; ix < 16; ++ix ) {

            if ( ix % 2 ) {

               mbit.set( ix );

              unsigned long key = mbit.to_ulong();

                pcomp.add_to_cache( frame, key, pim );





    // …



CManager::add_to_cache() frame: 11 key: 2     addr: 0x2710

CManager::add_to_cache() frame: 11 key: 10    addr: 0x2710

CManager::add_to_cache() frame: 11 key: 42    addr: 0x2710

CManager::add_to_cache() frame: 11 key: 170   addr: 0x2710

CManager::add_to_cache() frame: 11 key: 682   addr: 0x2710

CManager::add_to_cache() frame: 11 key: 2730  addr: 0x2710

CManager::add_to_cache() frame: 11 key: 10922 addr: 0x2710

CManager::add_to_cache() frame: 11 key: 43690 addr: 0x2710


frame # 11

        0000000000000010 :: 0x2710

        0000000000001010 :: 0x2710

        0000000000101010 :: 0x2710

        0000000010101010 :: 0x2710

        0000001010101010 :: 0x2710

        0000101010101010 :: 0x2710

        0010101010101010 :: 0x2710

        1010101010101010 :: 0x2710

I never actually deployed this. I had simply been asked to come up with a possible strategy for speeding up compositing on a software level just in case [I was moving on after the initial deployment of the system]. As it happened, moving the compositing to hardware was the optimal solution.


The two most effective optimization techniques, I suspect, are procrastination and cheating. For example, Martin Davis, in his text The Universal Computer, mentions a Pascal [!] program to compute Leibniz’s series for pi/4. Running on a 486 33MHz PC, summing 1 million terms took 50 seconds, 10 million took 8 minutes. Two years later, the same program was run on a Pentium 200 MHz machine, and “the times were reduced to 4 seconds and 40 seconds respectively.”


A compelling example of cheating is the original Doom game engine. Rather than calculate the hallway lighting for various hallways in real time, they simply pre-generated bitmaps. Level of detail is always a wildly-used cheat: things that are beyond a certain level of nearness do not need to be rendered with the same level of detail. My favorite example is detecting when one object intercepts with another [I can’t thing of the formal description of it off hand]. In a game like Mario Kart, there is none – the key element of the game is speed, and the audience [mostly children] find the occlusion [if that is the right word] of a truck or tree with Mario’s kart quite amusing, so why bother eating up cycles preventing it? During the making of the Hunchback of Notre Dame, Kiran Joshi developed crowd software that represented different individuals in cycles of movement in order to fill up the streets of Paris. Edge detection was necessary to the degree that having the hand of one entity piercing the face or body of a neighbor would spoil the illusion of reality necessary for the cartoon. This was managed by simple bounding boxes that marked a coarse-rectangle separating one character from another. It required considerably more work than did Mario Kart, but was not a significant amount of work overall. Not so in the case of the lovely Fantasia Segment, Toy Soldier, in which Dave Tonnesen’s work with simulating the physics of the movement of cloth was coupled with the separately generated movement of the ballerina. The required verisimilitude was labor intensive, but anything short of that would have been insufficient. Without knowing up front what the performance characteristics of a piece of software are required to be, it is very difficult to quantify that you have met them, and to know where to allocate precious programming resources.

Comments (10)

  1. coder says:

    Assuming x86, a little loop written in MMX assembly would probably beat the speed of your lookup table algorithm by 2-3 times 🙂

  2. Samuel Druker says:

    The formal name is "collision detection", which happens before "collision response". The former is computation geometry and usually involves things like convex hulls and interpolation. The latter is physics (well, kinetics) and is resolved with some sort of rigid body simulation and impulse-momentum calculation.

    Full on 3D 6DOF collision detection is wicked expensive no matter how you cut it. It is an area rife with opportunities for optimization by cheating.

  3. Catatonic says:

    Great article! I wonder what a possible "just in case" scenario would be. OpenGL hardware ought to do this stuff without sweating. Was hardware support in Linux a bit too shaky to be relied on back then?

  4. Annita Krapp says:

    Yes but by the time you dick about in MMX assembly you would have cost us more.

  5. asdf says:

    Ouch, a LUT to compute mins. You could of gotten mins really cheap using:

    inline unsigned min_pix(int a, int b)


    return a+(((b-a)>>(CHAR_BIT*sizeof(int)-1))&(b-a));


    Comes to around 3 cycles per pixel on an x86.