Repro, Man




Repro, Man

Written communication is amazingly difficult. Even when you’re aware of
some of the pitfalls, as I was when I wrote my last post about the disk full
save error in Word, it’s still all too easy say something that other people
will take to mean something other than what you had intended. Some of the
comments to that post, both here and in other blogs, clearly indicate that I’d
fallen into one of those pitfalls. And it’s a simple pitfall. I’d used a word
that has a particular meaning within this context to some readers while it has
a completely different meaning in other contexts.

I’d said that we didn’t know that the real problem with the disk is full
error involved hitting the OS’ open file limit. The problem is that the word
“know” can have several shades of meaning such that not knowing something can
span an entire range from having no clue whatsoever to having a very strong
suspicion while still being unable to conclusively prove something as a matter
of fact. Comments I’ve read, both here and elsewhere, reflect this range of
meaning in a rather interesting way. The more someone appeared willing to
attribute cluelessness to my remark, the less that person actually knew or
understood of the process of software development and fixing software bugs.

I point this out, because there are those who have an extremely judgmental
attitude that goes beyond simply being rude. There is a tendency among some to
respond to this kind of ambiguity by assuming the worst, or, more accurately,
by choosing a meaning that confirms their own prejudice while giving little or
no consideration to other possible meanings. These are people who will take a
remark out of context, and turn it into a personal attack. To the people who
do this, and, frankly, I think you know who you are, if what you write
personally attacks me, then your words serve no useful purpose other than the
possible gratification of your own egos.

Anyway, I digress. Sorry for the ambiguity, so please allow me to dispel
the ambiguity by telling the story of another problem that we ran into with
early versions of Jaguar. Jaguar beta testers were running into strange
crashes in Word. We collected a large number of crash logs, all of which
involved a protected memory violation in either GetHandleSize or memmove. The
former is an operating system call, the other is a compiler intrinsic that
moves data from one location in memory to another. Both of them involve
loading an address and dereferencing the address as a pointer.

Now, there are two things an experienced developer can immediately tell you
about this problem. The first is that some piece of code, somewhere, has
overwritten some data that didn’t belong to it yet wasn’t in a block of
protected memory—in essence, a buffer overrun. The second is that the
actual bug could be anywhere, and not at all related to any of the code that
happened to be executing at the time of the crash. These are nasty bugs,
because figuring out what’s wrong requires a significant amount of sleuthing.

So, we looked for more information about the scenarios that were causing the
crash both by asking users what they’d been doing before the crash and by
having testers try to reproduce the problem based on what users were telling
us. It took a while, but we found a common element to the problem: at some
point someone clicked on the font drop-down in the formatting palette (or the
font menu itself), and discovered some blank entries items on the
menu/drop-down. The other is that the crash most often happened when they did
something that added a file to the most-recently-used list on the File menu.

That information was enough for me to take a look at the order in which
various chunks of memory were being allocated, and, sure enough, Word’s
internal data structure for the font menu was being allocated immediately
before the handle (a “handle” being a doubly-indirect reference to a piece of
data in memory allowing the data to be moved around without requiring all
references to be updated after the move) for the data structure that contains
the files on the MRU list. This was a major clue. If the “GetHandleSize”
crash involved the MRU handle, then it’s likely that some piece of code for
maintaining the font menu had a buffer overrun.

So, I poked around in the font menu code, and found what would very likely
be the cause of the problem. Apple added some APIs that would allow
applications to ascertain whether or not the contents of the font menu would
changed since that last time the application had asked for the font menu. For
Word X, we added support for this so that we could update the font menu.
Unfortunately, there was another, rather ancient, piece of code that assumed
the contents of the font menu would never change. This piece of code had what
would be a buffer overrun should the contents of the font menu ever grow as a
result of calling this new API.

OK, so I wrote a fix for this, but we still had a problem. I’d come up with
this fix not by actually reproducing the problem and watching the buffer
overrun happen under a debugger, but by deducing where the buffer overrun
occurred. What we lacked was a consistently reproducible set of steps that
showed how the crash was happening for actual users. Without that set of steps
under which we could consistently reproduce the crash, we had no way to prove
that this fix resolved the problem that users were seeing. We strongly
suspected that we had a fix, but we didn’t know for sure.

The problem was complicated by the fact that no one, neither the testers who
could get the crash to happen occasionally nor the users who had reported the
crash and had given us information about what they were doing, had actually
done something that would have caused the contents of the font menu to change.
No one had added fonts to either the system font folder or their user library
folder. Something else was afoot, and we needed some help from Apple to figure
it out.

So, one of the Word program managers sent a piece of e-mail off to one of
our contacts at Apple asking about what circumstances in Jaguar would cause
this new API to report that the contents of the font menu had changed—the
point being that we wanted to be able to consistently reproduce the scenario.
Unfortunately, this PGM had worded the e-mail in such a way as to imply that we
didn’t expect the contents of the font menu to change after calling this API
whose specific purpose is to tell the app that the font menu had changed. The
Apple contact, quite understandably, asked why, on earth, would we not style='font-style:normal'> expect the contents of the font menu to change after
calling this API?

At that point, I replied by saying that we, certainly, expected the contents of the font menu to change, but that
there was an ancient piece of code in Word that didn’t expect the contents of
the font menu to change. I said, “It sometimes helps to remember that we hired
interns this past summer who were younger than Word’s code base,” to which the
Apple rep replied, “Gosh! That must some kind of milestone for an app!”

Well, we got the help we needed, were able to come up with a consistent repro style='font-style:normal'> scenario, and were able to prove that the fix I’d
implemented did, indeed, resolve the problem that users had been seeing.
Thanks to some very hard work by a lot of people, both in and outside of
Microsoft, we were able to have a service release of Word ready they day Jaguar
shipped, and very few users every actually ran into this particular problem.

The disk is full on save problem, however, was plagued by our inability to
come up with consistent repro scenarios, and while we’d suspected the OS file
limit was the root cause for some time, the question that always plagued us was
understanding why we could not reproduce the problem, in those rare instances
where we came up with a repro scenario, while running under the debugger. It
wasn’t until we learned that the debugger bumped the OS open file limit for the
debugged process that we knew the answer to that question, and could proceed.
Moreover, as we’d fix one scenario in one particular way, finding other repro
scenarios got increasingly difficult.

The point of all of this is that fixing a bug requires two things. The
first, of course, is some way to diagnose the actual problem and come up with a
fix. But the second, and not so obvious, is the need for a consistent repro
scenario that enables us to prove that the fix actually resolves the problem.
Without both, we’re really only guessing, and even though it might be a very
well educated guess, it’s still only a guess.

We resolved the font menu buffer overrun in September of 2002. Since then,
the age of Word’s code base has surpassed the legal drinking age in all 50
states. With a code base that old, you simply can’t afford to release a fix
based on even an educated guess. The risk that you’ve introduced a problem
worse than the one you’re trying to fix is simply too high. At that point, the
only alternative is to roll the fix into the next major release, which gives us
enough time to pound on the fix in a wide variety of scenarios such that we
have sufficient confidence in the efficacy of the fix.

Now, why tell these stories? Why even bother to air out our dirty laundry
like this? Certainly not because I’m a masochist. Rather I’m a pragmatist.
Word has a number of quirks, funky behaviors, and downright nasty bugs (of the
latter I’m certain, even though I can’t prove it). Fixing them requires a
cooperative effort between users and those of us who work on the product. We
need as much information as you can give, but I want people to understand the
difference between actual information and speculation about the causes of the
problem.

I can pretty much guarantee that you won’t be able to out speculate those of
us who work on the product. For any given problem that you’ve encountered,
there’s a pretty good chance that we have suspects that never occurred to you.
Sgt. Joe Friday’s most famous line is apropos. Speculation, especially
speculation that is highly judgmental, is utterly useless to me, and a waste of
both your time and mine. I gotta have a repro, man.

And for all of those people blogging about Word, don’t think we don’t read
what you’ve had to say. In fact, right now, there’s a team of developers ,
testers and users who are scouring the Internet and various newsgroups for any
reports of problems with Office 2004. They’re known as the MacSWAT team. They
include all of the Mac Office MVPs—people who really have both my
admiration and undying gratitude.

 

Rick

Comments (24)

  1. Etienne Travailles says:

    Rick: if the Word code base is so complex that one has to be very careful not to break things when making attempted bugfixes, why did the MacBU alter Word in the 2004 release so that it breaks important add-ons, like EndNote and MathType? A lot of us depend on using EndNote and MathType for production of technical documents for publication with Word. (I can’t upgrade to Word 2004 until the third-parties that make add-ons provide their compatible versions, and it seems that ISI ResearchSoft is going to require a new version, for which I will have to pay.)

  2. (If I can jump in here, Rick…)

    Actuually, the change in Office 2004 that makes it incompatible with EndNote is not in Word at all. We had to make a change to another shared library to solve an incompatibility with Panther that would have prevented Office from working on that OS at all. That change was not made lightly — we knew up front that it would mean that EndNote and MathType would fail to work. It was a matter of making tradeoffs — release a product that works with EN/MT but doesn;t work on Apple’s new OS, or release a product that works on Apple’s OS but requires 3rd party vendors to make a change.

    The former basically puts Office dead in the water from the get-go, whereas the latter can be remedied and eventually everybody is happy.

    Schwieb

    MacBU Development Lead

  3. JYC says:

    Is the Word code base really that old?

    This MVP page by John McGhie claims that "Word on the Macintosh is basically Word for Windows re-compiled to run on the Mac," which would mean that it probably doesn’t date back to 1984.

    http://word.mvps.org/FAQs/WordMac/Differences.htm

  4. Rajesh says:

    Nice blog. I fully agree that some of the bugs takes a bit ingenuity to solve. And I think a person who has worked on that software for a longer time can solve such bugs than a person with less experience(It may depends from person to person).

    By the way about the disk full error, I just remembered about piece table that there cannot be more than 16 pieces in it. After which the pieces get joined into the normal stream and the counting begins again. And this is when only fast save is enabled?

    Are you talking about the same piece table OR you create a different one during run-time.

  5. Etienne Travailles says:

    Erik: Thanks for your comments. It must be pretty complicated to keep track of the cascade of effects that a single change in the system makes. It would be interesting to know what the prime mover was that caused the change in the library that you mentioned. You see, Word v.X as patched with the latest updates runs _just fine_ under Panther, and the current versions of EndNote and MathType work great with Word v.X. So it wasn’t Panther _per se_ that caused the change that you folks had to make that broke the add-ons. 🙂

  6. Marc Bizer says:

    Rick: your comments illustrate for me the problems of maintaining an extremely old code base. At what point (if ever) does a software team decide that it is no longer worthwhile to do so and rewrite the program from scratch? Has this ever been done at Microsoft?

  7. Scott says:

    Marc,

    I’d say that Windows NT is an example of MS starting over from scratch. Larry Osterman (http://weblogs.asp.net/exchange/articles/85057.aspx) could probably clarify or qualify that statement though.

  8. Etienne Travailles says:

    From what you said here and elsewhere in your blog, you guys at the MacBU would be a lot better off if you would just port Word to the Cocoa API; just start over and use the knowledge and technologies/features that you have developed over the years and make a modern app that you can just drop in place. You’d save a ton of money in the long run, and you’d get an app that would pretty much autoport itself as Apple updates the foundation. You’d also get a chance to get rid yourself once and for all of all of the interlocking legacy code that is causing you so much trouble.

    Further, if you lean on Apple to release its Yellow Box API, you would also get a Windows app for free. You could get a single code base, a modern architecture, and freedom from the old stuff. There’s no need to hurry with this since you can keep shippin’ the current products on the Mac and Windows sides. Even if this process takes until 2008 or 2010, you’d win big in the long run and so would your users. The current situation is not good; eventually you will probably find the bugfix process that you describe above to be just too expensive.

  9. Rick Schaut says:

    JYC and Marc, yes the Word code base is really that old, though I’m not sure if there’s any left that’s actually from the initial go-around. Dring each product cycle, we go through a process known as "refactoring," where we rewrite/clean up portions of the existing code base. In fact, we just refactored Word’s line layout code for Word 2004.

    Etienne, Erik would have a more definitive answer regarding the EndNote/MathType issue, but my recollection as to the initial change is that it involved some security issues. I don’t know the full details as I wasn’t involved in the investigation that pinned the issue down.

    As for the complete rewrite in Cocoa, I’m afraid you’ve lost me entirely. Why would that resolve the problem? The underlying issue stems from the range and/or the complexity of the problems that Word attempts to solve for users. Expressing the solutions in a completely new language doesn’t change the nature of the underlying problem. How does one come to believe that the result won’t introduce a bevy of other problems that need to be solved?

  10. Etienne Travailles says:

    Cocoa is Apple’s object-based API that began life in NeXT’s NeXTStep and OpenStep environment. It uses Objective-C or Java as the main two languages. It has excellent facilities for user interface and typography, just to name two things that I know about. It is possible to get a user interface going in very short work without writing any code at all. I recall Steve Jobs demonstrating this while he was working with NeXT; he put together an app in real time with a modern bells and whistles user interface in front of a live audience.

    By redoing Word from the ground up in Cocoa, you will have the chance to undo the restrictions you have in the current Word code base that arose long ago–the machines and APIs that were targeted then are pretty far removed from those of the present day, under WinXP and Mac OS X, so it is arguable (at least) that some economies will be obtained in writing _directly_ to present day targets. Sure, some new problems will occur, but at least you on the MacBU will all be there when the problems arise rather than having to use telepathy and time-travel techniques to fix problems that were built into the code base back in 1985. It sure sounds like you are getting tied up in knots in the code you have now.

  11. Rick Schaut says:

    Etienne, I know what Cocoa is (c.f. http://weblogs.asp.net/rick_schaut/archive/2004/02/10/70789.aspx), and the primary reason Mac Office X had so few new features over and above Mac Office 2001 is because we completely refactored every piece of UI code in the entire suite to use some of the new facilities available under Mac OS X. We incorporated more of these new facilities in Office 2004, and we’re looking into doing still more work for the next major release.

    So, it looks to me like we’ve already achieved exactly what you would expect a port to Cocoa to achieve, have done so in far less time, and have not had to cope with the unknown problems that would come from a full port to Cocoa.

    Sure, there’s some cruft in Word’s code base, and that cruft presents a unique set of challenges. Porting everything to Cocoa might well resolve those particular challenges, but it brings an entirely new set of challenges to the fore. What’s not at all clear is how this swap from a known set of challenges to an unknown set of challenges would result in a net positive benefit.

  12. Evan Gross says:

    Hi Rick,

    Here’s one for your MacSWAT team (you might want to take a look as well):

    Word 2004’s text input handler(s) will hang (or otherwise behave badly) if handed > 128 characters. Not sure whether this is your kEventTextInputUpdateActiveInputArea handler, your kEventTextInputUnicodeForKeyEvent handler (presuming Word uses CarbonEvents), or both.

    Word v.X had a similar limit of 128 characters in its UAIA handler, but it would just discard the excess, truncating the input (if I remember correctly). But it didn’t crash.

    Word 2004 gets into some infinite loop, and hangs sucking 100% of the CPU. Fully reproducible with Spell Catcher X and TypeIt4Me, two input methods I know of that can send a UAIA event with (at least) 128 characters. In fact, I tried modifying Spell Catcher so it didn’t send a UAIA, but kEventTextInputUnicodeText instead (TSM will create and send kEventTextInputUnicodeForKeyEvent to the app when an input method sends this event), and the same thing happens in both cases.

    If Word 2004 does indeed use kEventTextInputUnicodeForKeyEvent, I hope it’s not actually enforcing, restricting or assuming anything about the amount of text in the kEventParamTextInputSendText parameter (that would be a "bad thing").

    I can certainly work-around this, but it would be nice if Word *really* handled Unicode text input at least as well as any Cocoa or MLTE app does. Spell Catcher is currently making the (incorrect, in Word 2004’s case) assumption that any app that creates a Unicode TSM Document can handle a large amount of text (well, I never send more than 8K – not a problem for NSTextView and MLTE).

    If you need anything from me to help get this fixed, just ask. Meanwhile, I have to rev. my product to deal with this. And I’m hearing from customers that MS support is telling them this is actually my bug.

    Evan Gross

    (Spell Catcher X author)

  13. I’ve done my fair share of dissing the bad software that is known as ‘Finder.app’ in OSX here. And sadly it remains a hugely buggy and unpolished bit of…

  14. peder says:

    Och ty benkartas