In the product end game, every change carries significant risk


One of the things I mentioned in my talk the other week comparing school with Microsoft is that in school, as the deadline approaches, the work becomes increasingly frantic. On the other hand, in commercial software, as the deadline approaches, the rate of change slows down, because the risk of regression outweighs the benefit of the fix.

A colleague of mine offered up this example from Windows 3.1: To fix a bug in GDI, the developers made a very simple fix. It consisted of setting a global flag when a condition was detected and checking the flag in another place in the code and executing a few lines of code if it was set. The change was just a handful of lines, it was very tightly scoped, and it did not affect the behavior of GDI if the flag was not set. They tested the code, it fixed the problem, everything looked good. What could possibly go wrong?

A few days after the fix went in, the GDI team started seeing weird crashes that made no sense in code completely unrelated to the places where they made the change. What is going on?

After some investigation, they discovered a memory corruption bug. In 16-bit Windows, the local heap came directly after the global variables, and local heap memory was managed in the form of local handles. A common error when working with the local heap was using a local handle as a pointer rather than passing it to the LocalLock function to convert the handle to a pointer. The developers found a place where the code forgot to perform this conversion before using a local handle. (In Windows 3.1, most of GDI was written in assembly language, so you didn’t have a compiler to do type checking and complain that you’re using a handle as a pointer.) Using the handle as a pointer resulted in a global variable being corrupted.

Investigation of the code history revealed that this bug had existed in the code since the day it was first written. Why hadn’t anybody encountered this bug before?

The handle that was being used incorrectly was allocated at boot time, so its value was consistent from run to run. The corruption took the form of writing a zero into memory at the wrong location, and it so happened that the variable that was accidentally being set to zero was not used often, and at the time the corruption occurred, it happened to have the value zero already.

Adding a new global variable shifted the other global variables around in memory, and now the accidental write of zero hit an important variable whose value was usually not zero.

In the product end game, every change carries significant risk. It’s often a more prudent decision to live with the bug you understand than to fix it and risk exposing an even worse bug whose existence may not come to light until after you ship.

Comments (22)
  1. Tim Kay says:

    After starting my first job out of school, I put it this way: In school, there are deadlines that really can’t slip, but quality can slip like mad.  At work, there is a level of quality you have to have, and so it is the deadlines that can slip.

  2. J says:

    At my last job we carefully assessed the risk to all of our changes near the ship date and we bordered on paranoia when it came to fixing anything other than a minor bug.  It worked out well for us because I don’t think we ever shipped with a major issue, and we’d fix all the little things in regular maintenance releases.

    The problem was a partner vendor who clearly didn’t have the same level of appreciation for reducing risk at the end of a product cycle.  No matter how often we asked that they delivered us the exact same software plus the single bug fix, we’d invariably find out that they shipped something with extra bug fixes.

    And of course the only way to find that out is to find a new bug in a completely different area of the product and trace it back to the vendor’s new release.  Ugh.

  3. Teo says:

    I am arguing about a related problem with Visual Studio. You give us Betas to find bugs, then I compile our code base with it, find a major problem – like a C++ standard violation, report it, then my bugs get closed with "thanks, but we have already locked down the code base and can’t fix it now." (Admittedly next VS usually fixes my bugs). Me and VS team have been playing this game since VS 2003 era :) But now I am growing older and tired of this game, I want the old-fashioned betas, which used to lead to reported bugs getting fixed.

  4. Brian Tkatch says:

    In school failure puts you back in the same class.

  5. Krunch says:

    in school, as the deadline approaches,

    the work becomes increasingly frantic. On the

    other hand, in commercial software, as the

    deadline approaches, the rate of change slows down

    I think you can see the same pattern in school and in commercial software if you compare the shipping deadline from school to the "planning" deadline in commercial software (or whatever you call the deadline where Product Management says "after that point, we are not going to accept any more change except for exceptions matching specific criteria").

  6. Anonymous says:

    "The developers found a place where the code forgot to perform this conversion before using a local handle."

    Pretty clever developers.  At least, I can see they’re good at re-assigning blame.  "We found this code, and it forgot to perform this conversion!  Silly sentient code, always forgetting to do things…"  Rather than say, "we forgot to perform this conversion."  :-)

  7. Aaron G says:

    Schroedinbugs are always the most fun.  There’s nothing quite like staring at code you know is wrong and trying to figure out how it ever worked in the first place.

  8. Jenk says:

    <i>"The developers found a place where the code forgot to perform this conversion before using a local handle."

    Pretty clever developers.  At least, I can see they’re good at re-assigning blame.  "We found this code, and it forgot to perform this conversion!  Silly sentient code, always forgetting to do things…"  Rather than say, "we forgot to perform this conversion."  :-) </i>

    Because of course the specific developers who wrote those original lines of code were still working on that specific section years later….  ;)

  9. BrotherLaz says:

    So Raymond, you say it would be better to ship with a hidden bug that would surface when this variable is used (ie. a broken feature that crashes Windows), instead of accidentally discovering it?

    [The hidden bug was harmless for years, and it would have remained harmless for many more years if nobody had perturbed the system. When you’re trying to stabilize your product, you don’t go around perturbing it. Obviously you want to identify the defects in your system, but you don’t want to magnify them in the process. -Raymond]
  10. Pierre B. says:

    I once had a problem like that from code I was given to maintain. For a few years, until we finally found the source of the problem, after each new version we’d have to modify the code until the memory corruption vanished once again.

    (The original author was obviously beginning in object-oriented design: he had designed a base class with a Do(void *) virtual function and each sub-class would cast the void * to something else. These classes were actually multi-threaded and would swap() pointers instead of copying memory for optimization. It was money laundering scheme applied to memory mamagement. You could not follow where the memory went.)

  11. Jiri Dvorak says:

    I got my lesson six years ago when I was working on a budget RTS game (it started as a turn-based game but that is a different story).

    Few hours before it was sent to DVD replication plant we were still fixing things :-|. One thing I fixed was a visual glitch in transition between control panel (contains commands and informations for currently selected unit) and panel used to display text of in-game conversations. From limited testing (the game was long so it was not possible to play entire game in the time we had left) all appeared to be working correctly so it was sent to replication.

    Problem was that the entire transition was triggered by two scripting commands. Those commands should have been specified in specific order. The older version worked even if the order was incorrect. The new version did not. There was exactly one place within entire game where the order was incorrect: near end of one late level. Because of this, the control panel was never restored and the mission could be not completed. In the end we had to create a zero day patch to fix this.

  12. Dean Harding says:

    BrotherLaz: The "bug" was essentially this:

    int x = 0;

    // stuff

    x = 0; // this is the bug

    // more stuff

    It just set something that was already 0, to 0. By changing the layout of global memory, the bug caused SOMETHING ELSE to get set to 0.

  13. Miral says:

    I had one of those recently.  I added some fields to a structure and suddenly some other code exploded.  Turned out one of the modules (the one responsible for initialising the structure) was compiled with a different byte packing than the rest of the code; when it initialised the structure it ended up writing a zero just past the "real" end of an array.  Previously, that didn’t cause any problems, because the next field in the structure was already zero at that point (and it wasn’t writing past the end of the structure).  But after the change, it was now toasting a vtbl pointer and causing a virtual function call to die with a null reference.

    It was quite a bit of fun to track that one down.  As Aaron said, sometimes you really have to wonder how stuff ever works at all… ;)

  14. peterchen says:

    @Teo: right to the mark. connect.microsoft.com is a quite discouraging experience.

  15. Andreas says:

    I’ve experienced this as well. I wrote a driver for the graphics hardware of our machine (instead of using memcpy() to blit) and weird things started to happen to the screen under certain circumstances. Of course everyone blamed my driver but I knew it wasn’t the source of the bug.

    Finally I discovered that the bug was a memory corruption bug due to incorrect clipping of screen elements before they were rendered. The bug had been there all along, but before my driver it had corrupted system memory instead of video memory.

  16. James says:

    Teo, one challenge is explaining to end users (you) why it’s better to ship instead of delaying shipping for the testing it’ll need to fix the reported bugs and the bugs those bug fixes introduce.

  17. Hardware Junkie says:

    I had to tell an earnest, newly promoted, dev mgr who wanted to fix every bug "We can’t ship this until you stop changing it." He got it.

    OTOH, even at the mother ship, I’ve seen massive changes demanded late in the dev cycle just for the sake of doing something, in this case, about perf.

  18. yngwe says:

    Adding a global flag to change the behaviour of another function is bad. I’m glad it slapped you in the face, honestly.

  19. rev says:

    i had the same result when refactoring a header.  all i did was rearrange the variable declarations to logically group public/protected/private and all kinds of bad happened.  sure enough, the code was writing past the end of a buffer and in the past it was benign.

  20. Teo says:

    My experience developing Windows software is that Debug and Release x86 / x64 builds have sufficiently different memory layouts so these bugs happen very rarely in our code base. I cannot remember a major memory corruption happening in the last 5 years and certainly not in the final sprint before release.

    And Visual Studio and the C++ compiler are tools I use every day all day so I am very sensitive to their bugs.

  21. Josh says:

    It happened to me a couple years ago while working at Microsoft on another product.  We were getting shutdown crashes that appeared to be in some input related code, but only when compiled for release; debug code never hit it.  It turned out the problem was caused by a recent change to the input code (reverting the change fixed the problem), but the crash still appeared unrelated.

    After working with other people on my team to debug it for quite a while (I was relatively new to the debugger at the time), we discovered that the stack was getting corrupted, and the single value being written was a status code indicating that a timer was interrupted.  But if we tried to watch that memory address with an on-write breakpoint, it never triggered.  Searching further, I discovered a bit of code called in the same thread, but otherwise unrelated, which was passing a struct to a timer initialization function, but the struct was stack allocated. Problem was, the timer was kernel mode, and involved a status write back when it fired (even though we didn’t care about the value at that point).  Most of the struct was configuration data; after you sent it, you didn’t care about it anymore, but one single 32 bit integer was a write back.

    I eventually figured out that the input change had increased the depth of the stack at the time of shutdown so it overlapped the area where that struct had been allocated. The breakpoint never tripped because a user mode debugger wouldn’t see kernel activity (the timer writeback), and it didn’t hit when compiled for debug because the unoptimized code made the write back hit an innocuous (or mostly innocuous) address.

    Still the most annoying problem I’ve ever had to debug.

  22. ton says:

    @peterchen

    @Teo: right to the mark. >connect.microsoft.com is a quite discouraging >experience.

    Peter you’re on the money there. I signed up for the site trying to report a bug with the VB debugger crashing VS2003 w/ SP1. (Why would such an elite developer such as myself be maintaining a VB.NET project don’t ask but I’ll admit I like the case insensitivity.:-) )After starting the debugger to go thru a largish project (approx. > 1MB ) the debugger  often crashes on shutdown and takes VS with it. Probably some kind of segmentation fault caused the issue I think. I tried to report this on connect but the search feature couldn’t successfully find any obvious place to look up FAQs related to the problem or report the issue. The entire site is a major usability nightmare that seems to be designed to discourage the user from reporting issues. An incredibly frustrating experience to say the least.

Comments are closed.