Anatomy of a Software Bug




Anatomy of a Software Bug

Chris Mason is the person who hired me to work at Microsoft. By the time he
hired me, he’d already spent a great deal of time looking into the issue of
general software quality, and had written a memo (known as the “Zero Defects”
memo) that underlies much of our software practices today. The ideas have been
refined since then, but they haven’t changed much in terms of the basic
concepts.

One of my favorite Chris Mason quotes comes from that memo, “Since human
beings themselves are not fully debugged yet, there will be bugs in your code
no matter what you do.” We work to minimize the bugs in the software we ship,
but they’ll always be there.

The problem stems from the overall complexity of the software. In this
context, “complexity” doesn’t refer to the code itself. Rather, we’re talking
about the shear volume of things the user can do. In Word, for example, we
have:

  • More than 850 command functions (e.g. Bold and Italic are the
    same command function)
  • More than 1600 distinct commands (e.g. Bold and Italic are
    distinct commands)
  • At any given time roughly 50% of these commands are enabled
    (conservative estimate)
  • With just 3 steps, the possible combinations of code execution
    paths exceeds 500 million

Now, there’s a philosophical issue about the desirability of increasingly
complex software, but I’m not going to discuss it here. For all practical
purposes, I don’t think there’s much benefit to getting into a discussion about
it. It may be an interesting question on some level, but it’s one we’ll never
fully resolve. And I’m just not all that interested in getting bogged down in
an endless debate without the possibility of resolution.

I mention the issue of complexity because it leads to subtle interactions
that can be difficult to track down. To illustrate the point even further, I
thought I’d discuss the anatomy of one of the more famous bugs we’ve had in
Word: the “Disk is full” on save error. Before I do, however, I should point
out that Pierre Igot, after some prodding on my part, did provide us with a
sample document that helped us to track down one of the more subtle
interactions involved in this particular problem. For that, Pierre, I thank you
and so do Word users everywhere.

The story of this problem begins with a basic design decision made when
Richard Brodie was Word’s primary software architect. Brodie came to Microsoft
along with Charles Simonyi after working at the Xerox PARC where he’d worked on
Bravo—their version of the GUI word processor. A number of the ideas used
in Word came from that early effort. Brodie joined Microsoft in 1981, began
work on Word in the summer of 1982, and finished version 1.0 in October of 1983.
You can read about much of the story in Microsoft First Generation style='font-style:normal'> by Cheryl Tsang.

Brodie figured out that a document is really just a collection of pieces of
text, and that it didn’t really matter where each piece of text is physically
located within the document’s file. For that matter, you could have one piece
of text that came from one file and another piece of text that came from
another file. We refer to this collection of pieces of text as the “piece
table.” This design has a number of benefits. For example, if you copy text
from one document to another, you don’t have to actually copy the text from one
file to another—at least not right away. All you really need to do is
copy the appropriate entries in the piece table in the source document to the
piece table in the destination document. Of course, you do need to copy the
physical text and formatting from one file to the other when you save the
destination document, but delaying that physical copy until save time meant
that the actual copy/paste could be done very quickly.

This design also made implementing undo rather simple. In fact, according to
Brodie, implementing undo was the primary reason to use this design. With this
design, all you have to do is create an internal undo document. When the user
deletes some text from the current document, for example, you copy the deleted
entries from the piece table to the undo document and save some information
about where those piece-table entries had been located in the original document.
To undo the delete, you just copy the piece-table entries from the undo
document back to the original document.

This design does have one problem: where do you put the text that the user
types into the document if it doesn’t go into the file that’s behind the
document? To solve that, Brodie added something called a “scratch” file, and
the scratch file remains a core part of Word’s design to this day. On the Mac,
Word creates this file in your TemporaryItems folder. On Mac OS X, this folder
is located at /private/tmp/<UID>/TemporaryItems, where “UID” is your user
ID number (for most people, that’s 501, but it can be a different number
altogether depending on how your user account was created). If start up Word,
open the terminal window and get a listing of your TemporaryItems folder, you’ll
see a file named something like “Word Work File S_.” There may be a number
after the “_” character. That’s Word’s scratch file (the “S” standing for
scratch). You might also see one or more files named “Word Work File D_” with
some number after the “_” character. This is a back-up copy of a document file
(the “D” standing for “Document”).

At this point, we need to fast-forward the story by a decade to the next major
feature that brought this problem to the fore: multiple undo. For ten years,
Word had undo, but it was just a single-level undo. For Word 6, we added the
ability to go back and undo every change you’d made to the document since you
first started editing it. And, with Word’s document/file architecture, this
wasn’t all that difficult to do: just make the undo document contain multiple
records with one record for each change to the main document. It’s a very cool
feature, and most of us couldn’t think of how we’d survive without it. But it
leads to a problem.

It’s not uncommon for users to make a few edits to a document, save the
document, make a few more edits, save the document again, make a few more
changes, and continue this process of edit/save for hours on end. Each time you
delete text, however, the actual text itself exists in the last-saved file for
the document you’re editing, and, with multiple levels of undo, the undo
records for text deletions still point back to the last-saved version of the
document’s file before you deleted the text. The next time you save, Word can’t
close the last-saved version of the file, because the undo document still
contains a reference to it. So, if you keep editing and saving, you’ll
eventually hit an open file limit. At least this was true of Word 6. It’s
changed quite a bit since then.

Arguably, this is something we should have figured out before we shipped Word
6, but, as Chris Mason pointed out, we humans haven’t been fully debugged yet.
Moreover, it’s easy to say that one should have thought of a particular
interaction in a complex piece of software, but that’s way easier said than
done. When you’re implementing any given feature, you’re totally focused on
the basic problems involved in the feature itself. To put this into
perspective, the person who implemented multiple undo in Word is one of the
best developers who has ever worked on Word, and has, since, been recognized as
a Microsoft Distinguished Engineer.

The reality is, that we hadn’t realized we’d created this situation when we
added multiple levels of undo. Moreover, this problem has several different
variations on the basic theme. At this point, the story involves our efforts to
understand the nature and scope of the problem, and to come up with the “best”
way to fix it. Because of the variations, however, the problem has been like an
onion. We’d peel away one layer of the onion, only to find some other variation
that we hadn’t, for various reasons, figured out before.

As we weave our way through the rest of the story, there are some important
points to keep in mind. The first is that I can’t fix what I can’t see, and, where
software bugs are concerned, “seeing” means being able to watch the program
execute, via some debugging tool, at the key point in the execution of the code
where the problem occurs. In order to do this, I have to have a precise set of
steps that consistently reproduces the problem. This not all that different from
the problem a mechanic faces when trying to figure out the cause of that
mysterious engine noise that only occurs after you’ve been driving the car
around town for a few hours.

The second is that this particular problem is a developer’s worst nightmare.
The fundamental cause is a basic design decision that you made more than a
decade ago, and the only way to really fix it for certain is to rewrite the
entire application from the ground up. Since that’s simply not an option for a
product that you’ve shipped several times, you’re left with trying to make the
problem difficult for most users to run into while trying to also minimize the
negative effects if the user should ever run into the problem. This approach
can, unfortunately, lead you to believe that you’ve come up with an “optimal”
fix only to discover later that there’s another facet you haven’t taken into
consideration (because you didn’t even know it existed until you peeled away
the previous layer of the onion).

The third point to keep in mind is that we in Mac BU have relatively limited
resources. When there’s a problem that’s fundamental to Word itself, we tend to
let our Win Word siblings focus on that problem. Our efforts tend to have little
chance of adding to their efforts, and this frees us up to focus on problems
specific to Mac users. In general, this is the most efficient way to handle
problems that our users are having, but there can be instances where there’s a
Mac-specific dimension to a problem. As we’ll see soon enough, this particular
problem had a Mac-specific dimension that complicated our efforts to fix it,
and it took us a while to find that Mac-specific dimension.

Lastly, the fact that Mac Word’s code base has been forked from Win Word’s
means that the Win Word people can make a change in the code for one reason,
and that change can have other side-effects that we won’t see in the Mac
version until we run into some very specific circumstances that show us the
different behaviors caused by this change. In this particular case, Win Word
added two lines of code in a routine that would seemingly be completely
unrelated to this problem, but also made this problem much more difficult for
users to run into in Win Word than it was in Mac Word. This one is the last
piece (maybe I should say latest piece)
in the puzzle that we discovered only a few months ago.

Whew! That’s a lot to keep in the back of our heads, but, nonetheless, let’s
rewind back about ten years. Word 6.0 has just shipped on Windows, and we’re
pretty happy with people’s reactions to the product. It doesn’t take long,
though, for us to figure out that there’s a fly in the ointment. Reports start
trickling in about people editing their documents “for a while” at which point
they try to save their document and they get a “Disk is full” error. We’d ask
people what they were doing, and the response was always some form of vague
notion that they’d just been editing their document “for a while.” The precise
measurement of “for a while” varied from user to user. For some folks, it was a
little over an hour. For others, it was several hours. Reproducing the problem
appeared to be highly dependant upon the user’s work habits.

After several months of trying to figure out the problem, someone in testing
wrote a macro that inserted a large amount of text into the document and then,
in a loop, replace successive words within the document saving it after each
replace. Run this macro for a while, and you get a “Disk is full” error on one
of the saves, at which point you can no longer save your document. Cool! We
now have steps that reproduce the problem.

So, this document got handed off to a developer, who then fired up Word
under the debugger, opened the document and ran the macro. The problem “reproduced,”
but, for reasons that weren’t apparent at the time, the error that the
developer ran into was subtly different from the error that the tester ran into.
The developer thought about the problem he was seeing, and came up with one of
those “optimal” fixes I mentioned above. It was the “right” fix in terms of the
problem the developer saw, but it wasn’t the “right” fix for the problem that
the tester saw.

What was this subtle difference between what the developer saw and what the
tester saw? As I mentioned above, the basic theme of the problem is to hit an style='font-style:normal'> open file limit. In this case, there are two limits:
Word’s internal open file limit and the OS’ open file limit. It turns out that
the debugger bumps the OS’ open file limit from what it would normally be when
you run Word outside the debugger. When the tester ran the macro, Word hit the
OS’ open file limit. When the developer ran the macro, with Word running under
the debugger, Word ran into its own, internal, file limit.

After a few iterations of the tester saying, “Sorry, but the bug’s not fixed
yet,” and the developer saying, “What are you talking about? I don’t see the
problem!” they both figured out that they were seeing different errors. Crap!
The problem only reproduces when you’re not
running under the debugger, which removes the one case where the developer can
actually see what’s going on. At this point, we have yet to figure out that the
problem involves hitting the OS’ open file limit. At this point, though, the
developer isn’t completely in the dark, and comes up with a fix for the tester’s
problem.

As I pointed out, the problem involves the undo document having a reference
to the previously saved-version of the document’s file. The developer’s
original fix was to add some code, in the case where Word hit its internal open
file limit, that would basically remove everything from the undo document (what
we refer to as “nuking the undo stack”). Nuking the undo stack allows the save
to proceed, because Word can now close the open files that were referenced by
the undo document. However, since the tester was seeing a different error, the
developer’s fix didn’t handle that case.

Nonetheless, the developer took a different approach. Knowing that the undo
document was very likely to be involved, one could walk through the undo
document, and copy the text for any pieces that pointed to the previously-saved
version of the document’s file to the scratch file. He coded up the solution,
and handed a buddy-build off to the tester. The tester ran the macro, and the
problem was fixed. The first layer of the onion had been peeled away, but the
fix still wasn’t an “optimal” fix. As it stood, the chances that a user would
run into the problem had been greatly reduced, but we still hadn’t dealt with
the “minimize the damage if they do hit it” side of the issue. That’s because,
at this point, we had yet to understand that the problem outside the debugger
had to do with the OS’ open file limit. Because this problem wouldn’t reproduce
under the debugger, the developer had no way of knowing exactly where the
failure was occurring. Without knowing that, the developer didn’t know where to
add the code that would “nuke the undo stack.”

To give you a sense of the time frame, this fix was ported from Win Word to
Mac Word during the Office 2001 development cycle, and was back-ported into
Word 98 for a service release that was done not too long after that. It’s also
at this point where the Win Word and Mac Word stories diverge. There are two
reasons for this. The first is that this was the point in time where Win Word
got that two-line code change that I mentioned above. The second is that the
open file limit under Mac OS is different than it is under Windows. I might be
mistaken on this point, but I think the open file limit under Mac OS X is
different from the limit under Mac OS 9 as well.

At this point, we still didn’t know that the basic problem involved hitting
the OS’ open file limit. After a while, though, we did know that Mac Word users
were seeing this problem way more often than Win Word users. In fact, the
difference was enough for Mac Word testers to start investigating the problem
directly. One of the things we did know is that the problem involved file
references in the undo document. So, we came up with a variation of the
original fix.

In order to understand this, we have to understand a basic principle of
fixes. You make the simplest code change required to fix the problem. This
reduces the chances that the fix will cause some other problem that is,
potentially, worse than the one you’re trying to fix. When you’re mucking about
with the locations where data is stored in files, the potential for
catastrophic problems resulting from your fix is high. In that sense, the
original fix for this problem was limited to copying what might be known as “simple”
pieces. A “simple” piece has only text. A “complex” piece might have a graphic,
or it might involve a field in the document, both of which are likely to have
data in the file in addition to the text itself.

With this in mind, for Mac Word X, we modified the notion of what would be a
“simple” piece of text for the sake of deciding whether or not to copy a piece
from the previously-saved document’s file over to the scratch file. To view
this in a slightly different way, we made the code that copies undo document
referents more aggressive. This resolved another test case that the Mac Word
testers had developed, again using a slightly different macro that would
eventually cause the “Disk is full” error to occur. This fix didn’t actually
make it into the shipping release of Mac Office X, but it was included in a
subsequent SR (I don’t recall specifically which one).

At this point, we still don’t know that
the problem involves the OS’ open file limit. That discovery didn’t happen
until this past summer when, through the very persistent efforts of Mac Word’s
current lead tester, we were able to use some tools on Mac OS X to figure out
exactly what was happening. While we were able to verify this, we still didn’t
know the exact location where Word was failing to open a file due to having hit
the OS’ open file limit. Again, we still can’t get this to reproduce under the
debugger, and there are a couple of places in the save process where it can
fail because the OS won’t let Word open the file. So, rather than scatter fixes
all over the place, we went with the sure fix: lower Word’s internal open file
limit so we hit it before we hit the OS’ open file limit. This allows the code
that nukes the undo stack to kick in, and then save the succeeds.

This brings us to late February/early March of this year, and the discussion
I’d had with Pierre. While we still can’t reproduce the actual file open
failure under the debugger, we now have enough information about what causes
the problem to be able to predict when the failure will eventually occur. From
that, we knew enough about the bug for me to believe that Pierre shouldn’t
still be hitting that “Disk is full” save error in the version of Word he was
using. Yet, he was still running into the problem.

That was the bad news: there was something about this problem that we still
didn’t fully understand. However, armed with a sample document and the ability
to predict when the error will occur, I could do something we’d never been able
to do before: set up Word under the debugger, perform some steps in Word to see
if those steps caused the predictive condition to occur, and set breakpoints
that would tell me exactly why we weren’t able to copy pieces from the undo
document over to the scratch file.

This is where I discovered those two lines of code that had been added to
Win Word so long ago, yet hadn’t been added to Mac Word. When Word lays out a
page in page layout view and a header or footer is visible, it updates any
fields in the header or footer. If you have, say, a page field in the visible
header/footer, Word will update that field. This is particularly necessary when
you have the footer of one page and the header of the following page both
visible in the document window. Word has to layout two pages in the same
update, so it updates fields for the first page footer, lays out that page,
then updates the fields for the next page header and lays out that page.

Now, why would this result in a field being copied over to the undo
document? Well, Word has something called “auto undo tracking.” Basically,
when you’re typing, Word automatically tracks the changes you’re making until
you do something that causes Word to close out the “typing” undo record. You
can see this when you click on the “Undo” dropdown on the standard toolbar. You’ll
see “Typing <text you typed>” at various locations in the dropdown
interspersed with other actions you’ve taken.

The two lines of code that were added to Win Word paused automatic undo
tracking while updating these fields in the header or footer during page
layout, then un-paused automatic undo tracking once the field update was
finished. Ugh! How, on earth, were we to ever figure out that these two lines
of code were the primary reason Win Word users weren’t seeing this problem
nearly as often as Mac Word users? In any event, if you’ve stayed with me long
enough, here’s a tip you can use until we release an SR of Word X (or earlier)
with this fix. If you have a document that has headers and footers with page
fields in them, do your editing in Normal view, and you’ll likely never hit the
“Disk is full” save error.

Right about now, you’re probably asking, “Why did it take so long to figure
out what was up with this?” Well, you might as well ask why police departments
continue to have a large number of unsolved crimes on the books. The issue is the
same: the investigation stalls for the lack of any further leads to follow. For
the same reason that the police can’t just go out and start arresting anyone
who might be a suspect, we can’t go scattering potential fixes throughout the
code. Until we figure out what the precise nature of the problem is, we need
leads that we can follow. The mere fact that you’re running into a particular
problem isn’t a lead that I can follow. Specific details about potential
suspects, however, are leads I can follow. When it comes to software problems, leads
I can follow consist of information that helps me to reproduce the problems
consistently.

And, always remember that I can’t fix what I can’t see. I have to be able to
reproduce the problem while being able to run some kind of diagnostic tool. The
key to fixing a bug is predictability. Without predictability, I can’t fix it,
because without predictability I have no way to understand how the complex
interactions in modern software cause the specific problem to occur.

 

Rick

Comments (90)

  1. So what exactly is the Mac’s open file limit?

    Windows NT’s is in the hundreds of thousands of handles per process (depending on the amount of physical RAM available), I’m surprised you guys didn’t notice that the file handle count for Word was getting that high.

    Even Win9x had an open file handle limit in the tens of thousands IIRC.

  2. TWR IV says:

    Thanks for the interesting post. I remember the disk full error with great dismay although happily I haven’t seen it in years.

    If you want lists of reproducible Word 2004 bugs I’d suggest you put out a blog notice. We’ve already noticed a few around here, although nothing serious so far.

  3. matthew says:

    In DOS, I used to set files=20 to save my conventional RAM. I fiddle with this, moving it up and down, as some programs didn’t work if it was set too low.

  4. Ryan Gregg says:

    Great post! It’s really nice being able to read information about products I use everyday and what it took to development and resolve there issues.

  5. B.Y. says:

    I remember reading in a book (by Steve Maguire, I think) saying that Mac and Win Word codebases have been merged. Are they split again ? I can see the reason for GUI code branching, but not internal stuff like undo levels.

  6. David Buxton says:

    Wasn’t this bug in Mac Word related to the classic Mac file open limit of 384 open handles? A fairly well known limit to the classic Mac system. Mac Word 6 and later had a habit of using file handles for fast saves without closing them properly, and at some point a user would hit the max open file limit, hence these problems?

  7. Eric Albert says:

    A couple questions from above are answered by Apple’s <a href="http://developer.apple.com/technotes/tn/tn1184.html">Technote 1184</a>, "FCBs Now and Forever".

    The quick summary: Mac OS versions prior to 9.0 were limited to 348 open files or, more correctly, open forks. The limit was actually far lower in very early system software releases, but that’s another story. Mac OS 9.0 increased the open file limit to 8169.

    Mac OS X, being a Unix-like system, has completely different open file limits.

  8. great article, reminds me of the details, and explanations of those that you’ll find at http://www.folklore.org/index.py

  9. Rick Schaut says:

    Being a BSD derivative, the open file handle limit on OS X is 256. It can be modified, but that’s through a native BSD call. The technote is <a href="http://developer.apple.com/qa/qa2001/qa1292.html">here</a&gt;

  10. Rick Schaut says:

    B.Y. the Mac/Win code bases were the same as of Word 6.0. We forked the code bases as of Office 98.

    In theory, you’d think the internals should remain within the same code base, but, in practice, it becomes a source, and quality, control nightmare. Unless you have everyone doing quality checks on both platforms for every code change, you end up with Win developers breaking the Mac product and vice versa.

  11. aellath says:

    i *knew* there was a reason i assiduously avoided Office! AppleWorks has rarely (in a quick scan of my memory just now, no events occur) screwed up on me.

  12. Chris says:

    Hmm… seems to me that it would just be easier (and better allocate resources on the machine) if Word would just do a better job of cleaning itself up every now and then to not have so many open files.

  13. Oh my goodness. 256 handles/process? That’s obscenely painful. I’m not surprised this showed up on Mac’s only.

  14. Eric Hildum says:

    Thanks for the explanation of the Cut and Paste process. Now I understand a particularly nasty bug affecting the US and Japanese versions of Windows Word that I encountered while working in Japan. Summary of problem: receive a document from the US (made using US Word), cut and paste into Japanese document (using Japanese Word). Upon autosave or save, Japanese document is destroyed in memory and on disk. The delayed copying of the text would explain the behavior perfectly. Apparently, there was a bug in the text copy code executed when documents were saved.

    By the way I did try to report the bug via our $500,000+/year global support contract with Microsoft, and was told directly by our Microsoft support representative, and I quote, "I wouldn’t know how to file a bug report for that." Never was able to get it addressed, even though I had two good sample documents for reproduction of the problem.

  15. Fascinating! Thanks, Rick.

  16. John A. says:

    My God! You mean to say that you couldn’t pin this bug for years because you couldn’t get to it with the debugger? What about debugging through code manipulation? This isn’t Schroedinger’s cat.

  17. Kiliman says:

    One of the things that really bugs me is when you get misleading or unhelpful error messages (like "Unexpected Error").

    I’m assuming that since Windows and OS X are completely different platforms, that the "Disk is full" error is coming from Word and not the OS.

    Wouldn’t it have been possible to instrument Word so it would display the actual error code returned from the OS? I imagine for Windows the underlying error code would be ERR_TOO_MANY_FILES_OPEN.

    Kiliman

  18. skeptic says:

    The part about this story that bothers me is this line: "At this point, we still don’t know that the problem involves the OS’ open file limit. That discovery didn’t happen until this past summer." Well, users had posited this years ago as the explanation of the problem. So why didn’t this knowledge filter up to the MacBU? Because there’s no way to report it? Because Microsoft is in denial? Also, why doesn’t Microsoft just stop spewing these temporary files all over my hard disk? (Remember when they used to be *visible*???) Isn’t there anyone at Microsoft who is ready to admit that this architecture sucks?

  19. Rick Schaut says:

    John, I did leave out a few details in the story. In answer to your question, yes. I wrote almost as much debug-only code trying to track this down as I’ve written shipping code for some features.

    Kiliman, that sounds easy on the surface, but there are several places where a failure in an OS call could result in that error message. Knowing where to instrument is almost as difficult as tracking down the bug itself.

    Remember, also, that the key point in this, the discovery about the open undo tracking while header/footer fields were being updated, would never have been caught by an instrumented version. It’s also a scenario that can’t be duplicated by running a macro to simulate user behavior. The key to getting that down was having the bevy of diagnostic tools available on Mac OS X.

    Skeptic, there’s a difference between "knowing" the source of a problem and being able to prove it. It’s like knowing that someone has committed a crime, but being unable to prove that fact in a court of law. What we couldn’t do was prove that this was the source of the problem. A sure way to introduce other bugs that are potentially worse than the one you’re trying to fix is to make some change to the code when you can’t prove that the change actually fixes the problem.

  20. MacJack44 says:

    This saga is truly informative (if exhausting), and supports my growing conviction that both OS development (by any company) and software development (by any company) has reached an overload limit.

    Notice that in both MS and Apple cases, advancement of the OS has resulted in endless updates, vulnerability discoveries and too much time and money spent by consumer/users just to "keep up." The same applies in "simple" word-processing apps. AppleWorks ported to OS X is buggier, less useful and more annoying than it ever was under OS 9 (for example). Third party T/Es like Nisus Writer have taken ferrrevvverrr to reach full featured useability and suffer from the same kind of "generational bugs" as OS X and AppleWorks.

    MS Word and MS Office are a perfect example of trying to "do all, be all" to prospective customer / users. But this is consumer choice, the line of thinking goes, "Better buy the whole hog, might need it some day…" Or the ever-popular: "Gotta have the latest version" muck. When in fact, I’d bet there’s a large percentage of users who use only a fraction of MS Office features. And, to "just write a novel" you need only use RichText format.

    This nonsense causes real problems: when publishers, for example, demand electronic submissions be in Word format. Why? There’s no legitimate reason, other than a kind of mindset prejudice.

    Simply put from the consumer / user standpoint: We have committed the sin of expecting too much from our computers. The "computer companies" are only trying to satisfy "Demand" (with a capital "D"). Their efforts (as exemplified by Rick’s story) have been heroic, yet there’s no end in sight for this kind of problem — which ultimately falls on the shoulders of end users.

    I’m not against "newest and best" but have disciplined myself to use and want "just what I need" to communicate, to enjoy myself and to stay informed. I think it’s time that everybody just slowed down a bit.

  21. Ryan Clark says:

    Wow. As a Mac user, and someone who’s been using Word for quite a while, it was fascinating to read about what had been causing the dreaded "disk is full" bug. I used to work as a consultant in our university computer lab and it was incredibly frustrating when people using Word 2001 on the Macs would run into this problem.

  22. John Fisher says:

    Very well-told story.

    I am completely in sympathy with the problem. Its not MS fault that Word

    is too large – nobody wants Works though its usually free; its not the developer’s fault that Word changes too much and

    too often. The problems are endemic to large, complex software with

    millions of users. Other Windows software I use is equally buggy, and

    frequently has less usable design.

    However, I also think there is a near-total lack of two aspects of good

    engineering practice at MS: 1) they have never understood trace,

    logging, and error messages 2) they do (did) not implement code dumps correctly.

    As one commenter pointed out, if they had simply passed through the

    actual OS error it would have helped. Better yet the code that failed

    should have logged a failure indicating what code failed and why.

    If they were able to dump their code correctly, they would be able to

    run Word in the debugger, dump it, and sift through the output to find

    which code failed, and what the values were at the time it failed. My ( NT4 era ) experience was that the debugger was buggy and symbols did not always align.

    Having a power-of-two number for a value would have been a strong hint here.

    MS paid support is infamous. We had the same dead-end experience with

    fundamental problems with NT4 Wolfpack failover ( in the end it never

    worked ) at a company in which MS had some ownership.

    Lastly, there should never be ‘unlimited’ anything. Unlimited Undo is a

    marketing standard, not good engineering. Developers should always place

    arbitrary limits on repetitive actions to prevent unknowable results

    like this. If there had been a limit, and a log entry had then said, "UnDoer reached u_limit" all would have been easily fixed. Their undo function is so complex, that it may not qualify as

    ‘repetitive.’ If so, this might be a sign that it is inherently too complex.

  23. RG says:

    Thanks for the detailed walkthrough … I would love to see a similar explanation as to why text copied from Word 2004 and pasted into iChat comes out as a black graphic blob. It will paste fine into Mail, TextEdit, etc. (and then copy/paste from there into iChat), but it won’t go directly into iChat.

    I suspect it’s an iChat issue, since it pastes fine elsewhere, but I can’t find another Mac program that causes the same iChat behavior…

  24. Paul Berkowitz says:

    Great story, Rick.

    But really only 256 open files in OSX, compared to 8196 in OS 9?? I remember that joy that ensued when 384 increased to 8196. Is there something else in OS X that makes such a minute number of open files feasible? (Such as – do they get closed automatically or something like that?) I have never run into this limit in OS X in any app. Something just doesn’t sound right here, Rick. Do you, or anyone, have an idea where to look this up?

  25. Kiyooka says:

    Fascinating reading about how things work from the inside.

    I am looking for a contact in the MacBU Office group. As creator of arguably one of the most successful office add-ins of all time, and now on the mac, I’m interested in writing some add-ins for MacOffice. I sent some email to s sinofsky but he’s apparently PC only, as are all my other contacts.

    Do you guys have an add-in/office evangelist that you could point towards me? My contact nfo is on my blog site (above).

    -gen

  26. OS X file limit says:

    256 is only the baseline limit in OS X, per process.

    In OS 9, I believe the limit was global (but changable by some utilities).

    The actual limit in OS X depends on the amount of RAM in the system (more RAM, more file allowed). Also, this limit can be changed by the application. I don’t know if there is a fixed upper limit.

    Photoshop and Illustrator hit the same limit when porting to OS X (due to an OS bug that left files open after certain API calls).

  27. SomeRandomGuy says:

    Larry, the 256 open file limit is a soft limit imposed on processes so that they can’t go around eating system resources. In a shell you can use ‘ulimit’ to increase it, or in an app can use a BSD API to increase it if needed. There is a way to increase the limit globally for all processes, but I don’t want to post it since it is a bad idea to ever use it.

    The default open file limit for the system is 12288, but this can be increased using the systcl command to increase kern.maxfiles if you really needed to.

  28. SomeRandomGuy says:

    I didn;t explicitly explain it, so I guess I should say that the 256 limit is per-process. So Word eating 240 files has no effect at all on Photoshop etc. until you get into the 12288 open files range at which point you probably have bigger problems.

  29. JD says:

    I’ve dealt with this problem at clients for a long time now and to have a different perspective is helpful. However, I’m still of the opinion that most people would be productive and happy with the equivalent of Word 5.1. There wasn’t a lot of extra stuff in the way and one could get a document completed quickly without the "wizards" and "assistants" popping up and needing to be killed one by one.

    Take the code from that, port it, sell it for $99-I’d suspect that lots of people would buy it because it would be small, fast, and easy to use.

    I’m of the opinion that smaller and more focussed is the way that software should be looking.

  30. Rick,

    Interesting post. But maybe I’m missing something – surely this would have been tracked down much earlier by some runtime error logging code, switched in and out with a flag? You should be able to flip a header flag, and get *every* error code returned by an OS call or internal function, logged assertion-style in a text file with the source file name, line number and error code. For projects the size of Word, I think this should be built-in from the bottom up. It also lets you send debug builds to your users and then ask them to send back the log – it’s been a lifesaver many times with my own programs.

    Gideon

  31. On my current Windows XP machine, the handle count for the system is 10509. This isn’t just file handles, NT doesn’t differentiate between file handles and other kinds of handles.

    There are 14 processes with more than 256 handles open, including IE (345), perfmon (314).

    I’m surprised that a modern operating system like OSX has hard coded limits of any kind to be honest.

    But this is totally off-topic, and irrelevent to a rather remarkable piece of detection.

    Btw, for those like Gideon and John A, and Killiman. The fact that the Mac has such a restricted limit and Windows effectively doesn’t have a limit drastically reduces the ability to diagnose the problem. With NT processes routinely having hundreds of handles open, Word’s having one or two more simply falls out in the noise factor, while on OS X, those two handles could easily be the difference between a trivial resolution and one that requires much more work.

  32. Josh says:

    Yadda. Yadda. Yadda.

    I’ve got Word v.X and have installed updates 10.1.2, 10.1.4, and 10.1.5. And yet the bug still occurs… specifically when I have been working for hours and have saved the document many, many times. (In other words, only when I’ve worked really hard, and am really stressed and tired.)

    I suppose the most important question now is: is this fixed (finally) in Word 2004?

    Sheesh.

  33. Jason says:

    hi,

    i came accross to a software quote on a web site that says something like, vendors create buggy software to be able to sell it. as i remember this is from a famous author, but i lost the address. does anyone know about this quote and its author?

    thanks

  34. I stopped using Word within a year because of it crashing so often. That is what interested me in coding (so I could fix the errors) until I found that there was free software available that had over 90 percent of the functionality. And if there was an error, at least I didn’t have to pay hundreds of dollars to call Microsoft or for buying it in the first place. Then when OS X came out, I found that you could buy commercial software from small companies, whose software is much cheaper and their support is much better. A large company should be able to support their products much better and have the money to test it in more conditions for errors, yet they don’t. I have a strong suspicion that the bigger the company, the bigger the scam. It is a similar problem to how the US can possibly lose a war when they are at least 20 years ahead of the other most advanced Army in the world (of the technology that isn’t classified). You can have overwhelming firepower, but if you don’t have the intelligence to focus it at the right place and time, you can be defeated by a much smaller force. I would buy products from Mircosoft if they would certify that they hadn’t coded anything on them. People aren’t switching to Linux because of the liscences; they are doing it because Microsoft products don’t work right. Microsoft’s poor quality makes the whole industry look bad and holds back innovation. Why buy new software or a new computer when it doesn’t provide any more real value and the support is still unaffordable?

  35. dave rogers says:

    Rick,

    Thanks for your weblog, and for what you do at BU.

    I’ve never been a big fan of MS (just ask Scoble), but you don’t deserve the negative comments being left in your weblog.

    The point about human beings not being fully debugged is amply demonstrated here.

    I hope you keep writing stories like this, and _some_ of the comments have been interesting and useful.

  36. Julie Krauss says:

    Rick,

    Amen! to all Dave Rogers’ comments.

    Those of us who have been professionals know how hard it can be to figure out the problem. There’s even a novel about this exact issue: "The Bug," by Ellen Ullman.

    Once again, Rick, thanks for the story. It’s fascinating.

  37. Michael says:

    A possible lead:

    For a while I was running into the symptoms described very frequently in Word v.X, especially while translating documents with complex formatting (tables, styles, graphics, etc.). I use a translation assistance tool written in VBA called Wordfast (see wordfast.net), which among other things is able to automatically set the language property of translated segments to the desired target language. (Of course, you still have to do the actual translating yourself.)

    While working on one job, I started running into the "open files" bug so frequently that I couldn’t work properly. I used a tool called Sloth (sorry, I don’t have an URL handy) that identifies the actual names and paths of open files for each active application. I didn’t take notes at the time, so I no longer have the actual filenames, but I found a very large number of temporary files that were apparently related to spelling and grammar checking in the target language. When I temporarily removed the proofing tools for that language from their standard location, I stopped encountering the "open files" bug every few minutes and was able to continue work. This was several months ago and I don’t remember exactly what I did, but I think I probably also turned off the Wordfast feature that sets the language property of target-language segments.

  38. Rick Schaut, of the Microsoft Mac BU has written an excellent article on how hard it can be to track down a bug. However, lest one thinks that only big applications like Word can be that hard to troubleshoot, let…

  39. Maireth says:

    http://word.mvps.org/FAQs/WordMac/DiskFullError.htm

    I found this helpful. It is a tweak that lets you clear out all the temporary files opened by word so that you can continue working without microsofts "handy" fix of telling you to close every 20 saves. Really, who counts?

    Rick, thanks for posting. It was facinating to see this problem so fully disected.

  40. Norman Diamond says:

    5/20/2004 10:28 AM Eric Hildum

    > Summary of problem: receive a document from

    > the US (made using US Word), cut and paste

    > into Japanese document (using Japanese

    > Word). Upon autosave or save, Japanese

    > document is destroyed in memory and on disk.

    Odds are that it didn’t matter if one of the source documents had been made using US Word. Microsoft occasionally tested US Word (including but not limited to the case described in this blog entry). Odds are that the bug is either wholly within Japanese Word, or within the combination of Japanese Word and Japanese Windows.

    > The delayed copying of the text would

    > explain the behavior perfectly. Apparently,

    > there was a bug in the text copy code

    > executed when documents were saved.

    The bug could be anything related to the way Word stores its documents, not necessarily related to copying and pasting. Though it is fortunate that the copy on disk wouldn’t get corrupted until you actually did a save.

    > By the way I did try to report the bug via

    > our $500,000+/year global support contract

    > with Microsoft, and was told directly by our

    > Microsoft support representative, and I

    > quote, "I wouldn’t know how to file a bug

    > report for that."

    Surely your support contact only understood the US. Microsoft’s idea of globalization is still pretty much US-centric. If you had a support contract with Microsoft Japan then you would be able to submit a report.

    But even if you could get the report submitted, odds are that it would still never get solved. There are quite a lot of things you can see in Japanese versions of Office, and even just Windows without Office, that make it pretty clear that no testing was ever done. Things happen in the Start menu on the first reboot after installation, that could not be missed by anyone except Microsoft. Well, one of the bugs introduced by Windows NT4 SP4 was half-fixed in SP5, but has never been fully fixed in Japanese versions of Windows. In the US it was fixed in Windows 2000 during beta testing, but you expect a fix in US Windows’ handling of Japanese to be copied back in to fix Japanese Windows’ handling of Japanese, ha, no way.

  41. Rick, thanks for the interesting article. So that’s what caused me and my company to lose so much work 🙁

    We ran into this on upgrading to Word 98. Called Microsoft support and got little relief. It would have been very helpful if they had said "Yes, is is a known bug – please do not adjust your set". But the didn’t, so I reinstalled the OS and Office, on all machines – and still couldn’t fix it.

    In fact, a "fix" came out, but we continued to have the problem on and off for a long time.

    We resorted to quitting Word every hour or so. Certainly, the problem could be attributed to "saving too often".

  42. Anonymous says:

    Ensight – Jeremy C. Wright &raquo; Anatomy of a Microsoft Bug

  43. Anonymous says:

    Joakim Andersson’s blog &raquo; A bug is not always that easy to fix

  44. Anatomy of a Software Bug…

  45. Me. What I’m Not Watching. What I’m Downloading. Cups. What I’m Listening To. What I’m Reading: "Ancient Light" (with SPOILERS and ENDING). Web. <br>
    Contains great content EXCLUSIVE to loyal weekend readers!<br>
    Plus a super-multipoll! <br>
    Me Phew. Definitely need a three day weekend right now: feeling very abraded by the…

  46. Chris Online says:

    Anatomy of a Software Bug

  47. Anonymous says:

    diego sevilla’s weblog &raquo; The Anatomy of a Software Bug

  48. Anonymous says:

    Bug Links &raquo; Undo in Word 6

  49. Back when I started blogging, I had a go-around with Pierre Igot.  I’m not going to rehash it, but I…

  50. O post &#233; meio antigo, mas eu s&#243; li hoje e vale a leitura: Anatomy of a Software Bug

    Ele conta como um…

  51. Field of dreams… The software and IT industry is a field of dreams. More than ever all can come to the field to offer ideas and contribute to its evolution. One means of doing so is through portals. A …

  52. 我是個 Software Tester , 但是我突然發現, 我的 BLOG 裏的 "Test" 這個類別裏的 Post 居然沒有很多, 所以我想我應該多談談 Software Testing. 最近看了一些其它高手的部落格,

  53. [Anatomy of a software bug] A Microsoft developer explains how a tricky bug in Word’s undo stack behavior was tracked down….

  54. Work from home moms. Wahm com the online magazine for work at home moms. Moms work from home. Moms work at home.

  55. Buspar. says:

    Buspar online med. Buspar vs zanax. Buspar.