Measure Twice, Optimize Once

When I move offices, it takes 16 moving boxes to hold my junk (I‘ve got a lot of it).  One reason is because of all the books I’ve collected over the years.

Many of them were read once and discarded, others are treasures I come back to time and time again.  One of the ones I come back to regularly (at least once a year) is Jon L. Bentley’s “Writing Efficient Programs”, a book that is sadly out of print.

In that book, I have a bookmark (actually it’s a ripped off piece of tissue) that’s been there for at least 10 years now because I continually refer back to the story.  As Jon tells it:

Victor Vyssotsky enhanced a FORTRAN compiler in the early 1960s under the design constraint that compilation time could not be noticeably slower. A particular routine in his program was executed rarely (he estimated during design that it would be called in about one percent of the compilations, and just once in each of these) but was very slow, so Vyssotsky spent a week squeezing every last unneeded cycle out of the routine. The modified compiler was fast enough. After two years of extensive use the compiler reported an internal error during compilation of a program. When Vyssotsky inspected the code he found that the error occurred in the prologue of the "critical" routine, and that the routine had contained this bug for its entire production life. This implied that the routine had never been called during more than 100,000 compilations, so the week Vyssotsky put into prematurely optimizing it was completely wasted.

I don’t know how many times I’ve opened the book up and referred to this story when talking to co-workers about performance.

As an example, when we were merging the Exchange Server and MCIS products, we had a bunch of architectural review discussions where each of the two groups described their internal architecture to the other.  During the one where we described the Exchange 5.5 POP3 server, I described how it rendered the user’s email message to a temporary temporary file, then used the TransmitFile API to send it to the remote client.

The dev lead for MCIS expressed disbelief at the design: “What on earth are you guys doing?  You’re making dozens of system calls to write the file to the disk, just to be able to call TransmitFile!  There’s no way that that’s going to be fast enough for production systems.”  Instead, he suggested (rather strongly) that we render the file to user mode memory, then convert the server to use non blocking sockets and keep the kernel socket buffer as full as possible (without blocking the user mode threads).

To humor him, we made the changes he suggested, and then ran some of our performance tests on the new bits.  And what did you know, it didn’t make a bit of difference in our actual throughput (it may have been slightly slower actually, I’m not 100% sure at this point).

So of course we backed out all those changes and went back to our old mechanism.

The fundamental problem that the MCIS dev lead made in this case was to assume that the number of system calls made by the application was an accurate measurement of the throughput of the application.  He misunderstood a couple of things: First, he was operating under the mistaken belief that system calls (and thus ring transitions) on NT take a significant amount of time. Second, he failed to realize that there was significant synergy between temporary temporary files and the TransmitFile API – because the file data was never written to disk, the TransmitFile API was able to read the data to send on the socket directly from the filesystem cache.

Bottom line: Understand where your bottlenecks are before you begin to optimize.  And make sure you understand the underlying performance characteristics of your system BEFORE you start to optimize.

The carpenters have it right: Measure Twice, Cut Once.  In our case, it’s more like: Measure Twice, Optimize Once; Wash, Rinse, Repeat.