Repro, Man

Written communication is amazingly difficult. Even when you’re aware of
some of the pitfalls, as I was when I wrote my last post about the disk full
save error in Word, it’s still all too easy say something that other people
will take to mean something other than what you had intended. Some of the
comments to that post, both here and in other blogs, clearly indicate that I’d
fallen into one of those pitfalls. And it’s a simple pitfall. I’d used a word
that has a particular meaning within this context to some readers while it has
a completely different meaning in other contexts.

I’d said that we didn’t know that the real problem with the disk is full
error involved hitting the OS’ open file limit. The problem is that the word
“know” can have several shades of meaning such that not knowing something can
span an entire range from having no clue whatsoever to having a very strong
suspicion while still being unable to conclusively prove something as a matter
of fact. Comments I’ve read, both here and elsewhere, reflect this range of
meaning in a rather interesting way. The more someone appeared willing to
attribute cluelessness to my remark, the less that person actually knew or
understood of the process of software development and fixing software bugs.

I point this out, because there are those who have an extremely judgmental
attitude that goes beyond simply being rude. There is a tendency among some to
respond to this kind of ambiguity by assuming the worst, or, more accurately,
by choosing a meaning that confirms their own prejudice while giving little or
no consideration to other possible meanings. These are people who will take a
remark out of context, and turn it into a personal attack. To the people who
do this, and, frankly, I think you know who you are, if what you write
personally attacks me, then your words serve no useful purpose other than the
possible gratification of your own egos.

Anyway, I digress. Sorry for the ambiguity, so please allow me to dispel
the ambiguity by telling the story of another problem that we ran into with
early versions of Jaguar. Jaguar beta testers were running into strange
crashes in Word. We collected a large number of crash logs, all of which
involved a protected memory violation in either GetHandleSize or memmove. The
former is an operating system call, the other is a compiler intrinsic that
moves data from one location in memory to another. Both of them involve
loading an address and dereferencing the address as a pointer.

Now, there are two things an experienced developer can immediately tell you
about this problem. The first is that some piece of code, somewhere, has
overwritten some data that didn’t belong to it yet wasn’t in a block of
protected memory—in essence, a buffer overrun. The second is that the
actual bug could be anywhere, and not at all related to any of the code that
happened to be executing at the time of the crash. These are nasty bugs,
because figuring out what’s wrong requires a significant amount of sleuthing.

So, we looked for more information about the scenarios that were causing the
crash both by asking users what they’d been doing before the crash and by
having testers try to reproduce the problem based on what users were telling
us. It took a while, but we found a common element to the problem: at some
point someone clicked on the font drop-down in the formatting palette (or the
font menu itself), and discovered some blank entries items on the
menu/drop-down. The other is that the crash most often happened when they did
something that added a file to the most-recently-used list on the File menu.

That information was enough for me to take a look at the order in which
various chunks of memory were being allocated, and, sure enough, Word’s
internal data structure for the font menu was being allocated immediately
before the handle (a “handle” being a doubly-indirect reference to a piece of
data in memory allowing the data to be moved around without requiring all
references to be updated after the move) for the data structure that contains
the files on the MRU list. This was a major clue. If the “GetHandleSize”
crash involved the MRU handle, then it’s likely that some piece of code for
maintaining the font menu had a buffer overrun.

So, I poked around in the font menu code, and found what would very likely
be the cause of the problem. Apple added some APIs that would allow
applications to ascertain whether or not the contents of the font menu would
changed since that last time the application had asked for the font menu. For
Word X, we added support for this so that we could update the font menu.
Unfortunately, there was another, rather ancient, piece of code that assumed
the contents of the font menu would never change. This piece of code had what
would be a buffer overrun should the contents of the font menu ever grow as a
result of calling this new API.

OK, so I wrote a fix for this, but we still had a problem. I’d come up with
this fix not by actually reproducing the problem and watching the buffer
overrun happen under a debugger, but by deducing where the buffer overrun
occurred. What we lacked was a consistently reproducible set of steps that
showed how the crash was happening for actual users. Without that set of steps
under which we could consistently reproduce the crash, we had no way to prove
that this fix resolved the problem that users were seeing. We strongly
suspected that we had a fix, but we didn’t know for sure.

The problem was complicated by the fact that no one, neither the testers who
could get the crash to happen occasionally nor the users who had reported the
crash and had given us information about what they were doing, had actually
done something that would have caused the contents of the font menu to change.
No one had added fonts to either the system font folder or their user library
folder. Something else was afoot, and we needed some help from Apple to figure
it out.

So, one of the Word program managers sent a piece of e-mail off to one of
our contacts at Apple asking about what circumstances in Jaguar would cause
this new API to report that the contents of the font menu had changed—the
point being that we wanted to be able to consistently reproduce the scenario.
Unfortunately, this PGM had worded the e-mail in such a way as to imply that we
didn’t expect the contents of the font menu to change after calling this API
whose specific purpose is to tell the app that the font menu had changed. The
Apple contact, quite understandably, asked why, on earth, would we not expect the contents of the font menu to change after
calling this API?

At that point, I replied by saying that we, certainly, expected the contents of the font menu to change, but that
there was an ancient piece of code in Word that didn’t expect the contents of
the font menu to change. I said, “It sometimes helps to remember that we hired
interns this past summer who were younger than Word’s code base,” to which the
Apple rep replied, “Gosh! That must some kind of milestone for an app!”

Well, we got the help we needed, were able to come up with a consistent repro scenario, and were able to prove that the fix I’d
implemented did, indeed, resolve the problem that users had been seeing.
Thanks to some very hard work by a lot of people, both in and outside of
Microsoft, we were able to have a service release of Word ready they day Jaguar
shipped, and very few users every actually ran into this particular problem.

The disk is full on save problem, however, was plagued by our inability to
come up with consistent repro scenarios, and while we’d suspected the OS file
limit was the root cause for some time, the question that always plagued us was
understanding why we could not reproduce the problem, in those rare instances
where we came up with a repro scenario, while running under the debugger. It
wasn’t until we learned that the debugger bumped the OS open file limit for the
debugged process that we knew the answer to that question, and could proceed.
Moreover, as we’d fix one scenario in one particular way, finding other repro
scenarios got increasingly difficult.

The point of all of this is that fixing a bug requires two things. The
first, of course, is some way to diagnose the actual problem and come up with a
fix. But the second, and not so obvious, is the need for a consistent repro
scenario that enables us to prove that the fix actually resolves the problem.
Without both, we’re really only guessing, and even though it might be a very
well educated guess, it’s still only a guess.

We resolved the font menu buffer overrun in September of 2002. Since then,
the age of Word’s code base has surpassed the legal drinking age in all 50
states. With a code base that old, you simply can’t afford to release a fix
based on even an educated guess. The risk that you’ve introduced a problem
worse than the one you’re trying to fix is simply too high. At that point, the
only alternative is to roll the fix into the next major release, which gives us
enough time to pound on the fix in a wide variety of scenarios such that we
have sufficient confidence in the efficacy of the fix.

Now, why tell these stories? Why even bother to air out our dirty laundry
like this? Certainly not because I’m a masochist. Rather I’m a pragmatist.
Word has a number of quirks, funky behaviors, and downright nasty bugs (of the
latter I’m certain, even though I can’t prove it). Fixing them requires a
cooperative effort between users and those of us who work on the product. We
need as much information as you can give, but I want people to understand the
difference between actual information and speculation about the causes of the
problem.

I can pretty much guarantee that you won’t be able to out speculate those of
us who work on the product. For any given problem that you’ve encountered,
there’s a pretty good chance that we have suspects that never occurred to you.
Sgt. Joe Friday’s most famous line is apropos. Speculation, especially
speculation that is highly judgmental, is utterly useless to me, and a waste of
both your time and mine. I gotta have a repro, man.

And for all of those people blogging about Word, don’t think we don’t read
what you’ve had to say. In fact, right now, there’s a team of developers ,
testers and users who are scouring the Internet and various newsgroups for any
reports of problems with Office 2004. They’re known as the MacSWAT team. They
include all of the Mac Office MVPs—people who really have both my
admiration and undying gratitude.

 

Rick