The time one of my colleagues debugged a line-of-business application for a package delivery service


Back in the days of Windows 95 development, one of my colleagues debugged a line-of-business application for a major delivery service. This was a program that the company gave to its top-tier high-volume customers, so that they could place and track their orders directly. And by directly, I mean that the program dialed the modem (since that was how computers communicated with each other back then) to contact the delivery service’s mainframe (it was all mainframes back then) and upload the new orders and download the status of existing orders.¹

Version 1.0 of the application had a notorious bug: Ninety days after you installed the program, it stopped working. They forgot to remove the beta expiration code. I guess that’s why they have a version 1.01.

Anyway, the bug that my colleague investigated was that if you entered a particular type of order with a particular set of options in a particular way, then the application crashed your system. Setting up a copy of the application in order to replicate the problem was itself a bit of an ordeal, but that’s a whole different story.

Okay, the program is set up, and yup, it crashes exactly as described when run on Windows 95. Actually, it also crashes exactly as described when run on Windows 3.1. This is just plain an application bug.

Here’s why it crashed: After the program dials up the mainframe to submit the order, it tries to refresh the list of orders that have yet to be delivered. The code that does this assumes that the list of undelivered orders is the control with focus. But if you ask for labels to be printed, then the printing code changes focus in order to display the “Please place the label on the package exactly like this” dialog, and as a result, the refresh code can’t find the undelivered order list and crashes on a null pointer. (I’m totally making this up, by the way. The details of the scenario aren’t important to the story.)

Okay, well, that’s no big deal. A null pointer fault should just put up the Unrecoverable Application Error dialog box and close the program. Why does this particular null pointer fault crash the entire system?

The developers of the program saw that their refresh code sometimes crashed on a null pointer, and instead of fixing it by actually fixing the code so it could find the list of undelivered orders even if it didn’t have focus, or fixing it by adding a null pointer check, they fixed it by adding a null pointer exception handler. (I wish to commend myself for resisting the urge to put the word fixed in quotation marks in that last sentence.)

Now, 16-bit Windows didn’t have structured exception handling. The only type of exception handler was a global exception handler, and this wasn’t just global to the process. This was global to the entire system. Your exception handler was called for every exception everywhere. If you screwed it up, you screwed up the entire system. (I think you can see where this is going.)

The developers of the program converted their global exception handler to a local one by going to every function that had a “We seem to crash on a null pointer and I don’t know why” bug and making these changes:

extern jmp_buf caught;
extern BOOL trapExceptions;

void scaryFunction(...)
{
 if (setjmp(&caught)) return;
 trapExceptions = TRUE;
 ... body of function ...
 trapExceptions = FALSE;
}

Their global exception handler checks the trapExceptions global variable, and if it is TRUE, they set it back to FALSE and do a longjmp which sends control back to the start of the function, which detects that something bad must have happened and just returns out of the function.

Yes, things are kind of messed up as a result of this. Yes, there is a memory leak. But at least their application didn’t crash.

On the other hand, if the global variable is FALSE, because their application crashed in some other function that didn’t have this special protection, or because some other totally unrelated application crashed, the global exception handler decided to exit the application by running around freeing all the DLLs and memory associated with their application.

Okay, so far so good, for certain values of good.

These system-wide exception handlers had to be written in assembly code because they were dispatched with a very strange calling convention. But the developers of this application didn’t write their system-wide exception handler in assembly language. Their application was written in MFC, so they just went to Visual C++ (as it was then known), clicked through some Add a Windows hook wizard, and got some generic HOOKPROC. (I don’t know if Visual C++ actually had an Add a Windows hook wizard; they could just have copied the code from somewhere.) Nevermind that these system-wide exception handlers are not HOOKPROCs, so the function has the wrong prototype. What’s more, the code they used marked the hook function as __loadds. This means that the function saves the previous value of the DS register on entry, then changes the register to point to the application’s data, and on exit, the function restores the previous value of DS.

Okay, now we’re about to enter the set piece at the end of the movie: Our hero’s fear of spiders, his girlfriend’s bad ankle from an old soccer injury, the executive toy on the villain’s desk, and all the other tiny little clues dropped in the previous ninety minutes come together to form an enormous chain reaction.

The application crashes on a null pointer. The system-wide custom exception handler is called. The crash is not one that is being protected by the global variable, so the custom exception handler frees the application from memory. The system-wide custom exception handler now returns, but wait, what is it returning to?

The crash was in the application, which means that the DS register it saved on entry to the custom exception handler points to the application’s data. The custom exception handler freed the application’s data and then returned, declaring the exception handled. As the function exited, it tried to restore the original DS register, but the CPU said, “Nice try, but that is not a valid value for the DS register (because you freed it).” The CPU reported this error by (dramatic pause) raising an exception.

That’s right, the system-wide custom exception handler crashed with an exception.

Okay, things start snowballing. This is the part of the movie where the director uses quick cuts between different locations, maybe with a little slow motion thrown in.

Since an exception was raised, the custom exception handler is called recursively. Each time through the recursion, the custom exception handler frees all the DLLs and memory associated with the application. But that’s okay, right? Because the second and subsequent times, the memory was already freed, so the attempts to free them again will just fail with an invalid parameter error.

But wait, their list of DLLs associated with the application included USER, GDI, and KERNEL. Now, Windows is perfectly capable of unloading dependent DLLs when you unload the main DLL, so when they unloaded their main program, the kernel already decremented the usage count on USER, GDI, and KERNEL automatically. But they apparently didn’t trust Windows to do this, because after all, it was Windows that was causing their application to crash, so they took it upon themselves to free those DLLs manually.

Therefore, each time through the loop, the usage counts for USER, GDI, and KERNEL drop by one. Zoom in on the countdown clock on the ticking time bomb.

Beep beep beep beep beep. The reference count finally drops to zero. The window manager, the graphics subsystem, and the kernel itself have all been unloaded from memory. There’s nothing left to run the show!

Boom, bluescreen. Hot flaming death.

The punch line to all this is that whenever you call the company’s product support line and describe a problem you encountered, their response is always, “Yeah, we’re really sorry about that one.”

Bonus chatter: What is that whole different story mentioned near the top?

Well, when the delivery service sent the latest version of the software to the Windows 95 team, they also provided an account number to use. My colleague used that account number to try to reproduce the problem, and since the problem occurred only after the order was submitted, she would have to submit delivery requests, say for a letter to be picked up from 221B Baker Street and delivered to 62 West Wallaby Street, or maybe for a 100-pound package of radioactive material to be picked up from 1600 Pennsylvania Avenue and delivered to 10 Downing Street.

After about two weeks of this, my colleague got a phone call from Microsoft’s shipping department. “What the heck are you doing?”

It turns out that the account number my colleague was given was Microsoft’s own corporate account number. As in a real live account. She was inadvertently prank-calling the delivery company and sending actual trucks all over the country to pick up nonexistent letters and packages. Microsoft’s shipping department and people from the delivery service’s headquarters were frantic trying to trace where all the bogus orders were coming from.

¹ Mind you, this sort of thing is the stuff that average Joe customers can do while still in their pajamas, but back in those days, it was a feature that only top-tier customers had access to, because, y’know, mainframe.

Comments (41)
  1. Lars says:

    These can't be ordinary CPU trap handlers then? Otherwise the second exception should cause a Double Fault trap…? which, presumably, Windows would not allow an application to change.

    The third invocation would then cause a triple fault, which ultimately would reboot the system.

    [Operating systems typically do very little in the CPU trap handler. They transfer the work somewhere else. After all, it's totally expected that an application's divide-by-zero handler can take a page fault. Besides, if the callback ran in the trap handler, you wouldn't be able to longjmp back into application code. -Raymond]
  2. Joshua says:

    That story was so worth it.

  3. VinDuv says:

    I’m guessing the whole flaming death crash only happens on 3.1 and not 95, right?

  4. alegr1 says:

    A little bit of extra knowledge in wrong hands is a dangerous thing.

  5. morlamweb says:

    I see that you also resisted the temptation to put the word "developers" in quotes throughout the article.  I get the impression from this story that the "developers" approached this problem with no more skill than a young child uses a set of "My First Toolbox" plastic tools.  One wonders why they didn't fix the code properly rather than going with a global exception handler.  It would have made for a much shorter movie, but I doubt that those "devs" were thinking of that at the time.

  6. Fleet Command says:

    The case is nice but the writing style is leaves much to be desired; article is unduly long, and metaphors are just plain weird. I advise considering use of sub-section headings instead of strange movie metaphors which do not conjure up an image. Also, the part about freeing DS register definitely needs more explaining.

    Oh, and footnote #1 is apparently not connected to anything. I searched "1" in both page its web source code.

  7. Henke37 says:

    This article is of dailyWTF quality.

  8. Jimmy Queue says:

    That there is some epic fail on epic proportions. Imagine trying to do that in todays day and age of bright software engineers, I'm fairly sure there are ritualistic punishments of various formats specifically for when you fail this hard!

  9. Adam V says:

    @Fleet Command – footnote #1 is at the end of the very first paragraph.

  10. Wanna Bee says:

    EPIC!  What a great story.  Thank you for sharing.

  11. Fleet Command says:

    @Adam V: Yes. Added after my comment.

    [I think you merely overlooked it, because I have not edited this post. -Raymond]
  12. jader3rd says:

    Awesome! "Man, how are we going to get out of this hole?" "Just keep digging!"

  13. Guest says:

    @Fleet Command

    It was there before your comment. I came here at exactly 7am and read it as soon it as it came up, the footnote marker was there in the first paragraph when originally posted.

  14. Jim says:

    many commentators missed key word "Mainframe". You do not know how hard to set up a test case with it, even the excellent developers had a lot of trouble to deal with it.

  15. GrumpyYoungMan says:

    Good grief.  And people have the temerity to say we have bad software practices?

  16. JM says:

    …I can't stop crying. With laughter, mind you, with laughter.

    The final bit just nails it.

  17. not important says:

    It's amazing how many things had to go wrong for the system to crash. Like Raymond said in another post – the system has lots of redundancy built-in; sub-systems cover for each other and make up for errors elsewhere. And Raymond – if there is a "Swordfish 2" sequel you deserve to write the action scenes.

  18. morlamweb says:

    @Jim: the problem in this case was not the mainframe, but rather, in the programming practices used by the people who programmed the Windows app.  As I read it, the root cause was a bug in their GUI: one component expected a specific control to have focus, while in one corner case, another control had focus, which led to an unhandled null pointer.  Tell me, what about that involves a mainframe?  It looks to me like Windows GUI programming bug compounded by ill-considered attempts at fixing it.

    [The point is that they didn't have a test deployment because the company could not afford to have two mainframes (one for production and one for test). All testing had to be done against live data, not even a test account! (See Bonus Chatter.) The only way to test the code was to place a real live order and then presumably cancel it immediately and hope the accounting department doesn't get mad at you. -Raymond]
  19. dave says:

    >(since that was how computers communicated with each other back then)

    Come now; they had X.25 in Canada (over leased lines, or however you say that in Canadian) in the late 1970s.

    [A delivery service customer is unlikely to have X.25. They had a telephone line and a fax machine. -Raymond]
  20. alexcohn says:

    @FleetCommand: «¹» character cannot be found with `findstr 1>`

  21. Fleet Command says:

    [I think you merely overlooked it, because I have not edited this post. -Raymond]

    True. I might have. Please consider that sentence recanted.

  22. Jim says:

    @morlamweb, at that time the Mainframe would cause a fortune to have one, and not saying for two. Also set up the mirror operations on the test would be impossible as well. So I am not saying that were the good practice, in the business world we all have to cut the corners?

  23. jas88 says:

    The root problem here is the circular dependency: Microsoft being both a client of and a supplier to the delivery company. (Mostly tongue in cheek, but of course if MS hadn't already been genuine customers, they couldn't have assigned the genuine corporate account as the 'test' one!)

    I recall a story (DailyWTF?) about a test address – an online bookseller had a special test address and book title. For test purposes, they'd order a specific children's book, for a particular address; the logic further down the pipeline knew to ignore orders for that particular combination, so it was safe. Up until the marketing department started doing data mining, and stumbled across this hidden gem of a book in their catalog, which clearly had a huge following in that region…

  24. alegr1 says:

    @Jimmy Queue:

    >Imagine trying to do that in todays day and age of bright software engineers,

    Just this morning, someone posted a question titled "Make sure an address is valid" on osronline.com NTDEV forum. It presents an abomination of even greater scale.

  25. Mark VY says:

    I disagree with Fleet Command.  The movie metaphors worked great for me, to the point that I feel like I just watched the best trailer ever!

  26. alegr1 says:

    The movie metaphors are Titanics of metaphors!

  27. Fleet Command says:

    @alegr1: Titanic has a villain with desk and slow-motion scenes? Nice thing I decided to watch Phantom's Menace instead.

  28. Antonio Rodríguez says:

    Yes, this is an *epic* WTF. Layer after layer of disastrous decissions piled to create a monster of incredible dimensions.

    @VinDuv: I guess it also killed Windows 95. Not only the article hints that, also, many of Windows 95's system componentes (in particular, most of User and GDI) were 16-bit. Remember that Windows 95 architecture is based in Windows 3.x Enhanced Mode, adapted to run 32-bit processes inside the system virtual machine (the one used for 16-bit tasks). User and GDI were actually 16-bit DLLs called by thunking from 32-bit processes, so if you achieved to unload them from a 16-bit task, you were dead.

  29. Neil says:

    16-bit Windows did simulate local exception handling, but it was hardwired only to work for certain system DLLs (it was probably implemented using the global exception handler looking to see which DLL the exception happened in, and if it was a recognised DLL, performing some local exception handling) so the bogus part of the code was failing to let Windows terminate the process for you.

  30. Jon says:

    I still blame them for incompetent mainframe operations. IBM mainframes have had virtualization since 1972! And partitioning, physical or logical, has been around since then. After all, are they developing software right on their live system?

  31. Mark says:

    @Jon: given everything else we've learned about them so far, would you be shocked to find out they?  I wouldn't even be surprised.

    There's actually a lighter weight solution: have the concept of a test account.  Any orders placed by a test account do not get fulfilled.

  32. Paul Coddington says:

    "There's several reasons why your application crashed, but let's just stick to the technical ones for now…"

  33. Jeff says:

    I want to believe that this story has at least some basis in reality, but then I see parts like:

           * "I'm totally making this up, by the way. The details of the scenario aren't important to the story."

           * "I don't know if Visual C++ actually had an Add a Windows hook wizard; they could just have copied the code from somewhere."

    So at least some of it is admittedly false. But are those the only parts that were embellished, or that are outright nonsense?

    While some totally true stories are quite unbelievable and absurd, and there's a lot of insight that can be obtained from them, fictional stories littered with inaccuracies or just plain misinformation are more harmful than helpful.

    - Jeff

  34. alegr1 says:

    @Jeff:

    Reverse rule 34: If you totally make up an engineering WTF, there is somewhere a real guy who committed that IRL.

  35. Daniel says:

    In your blog and book I have so often read that Microsoft helped big companies with debugging or by adding compatibility fixes to Windows. I wonder how many such a support did cost and if it is still offered, or just during the Windows 95 era.

  36. Daniel says:

    And another question: do the vendors give you the source code of their software or do you need to debug with debug symbols only (or worse, only by reverse engineering)?

  37. yuhong2 says:

    @Daniel: AFAIK even MS's own Office products don't provide debug symbols to the public.

  38. Cheong says:

    The Explosion MV was one of my favorite a few years ago. Bonus point for the ironic action they do at the end. XD

  39. GWO says:

    To paraphrase JWZ: you have an application problem – you decide to handle it using some combination of setjmp() and longjmp().  You now have two (or possibly uncountably many) problems.  Even thinking about using them should count as a code smell.  Actually using them is almost always an error, and often should be considered a discplinary offense.

    / Exception: geniuses

    // Non-exception: Me (and, in all likelihood, you)

    /// Also between creat() and longjmp(), what did Krnghn & Rtchy have against vowels?

    //// (I assume JMP was an instruction on whatever machine (PDP-11??) they wrote C on).

  40. KJK::Hyperion says:

    @GWO: IIRC, standard C function names are short because K&R used a linker that only considered the first 6 characters of symbol names for uniqueness

  41. Malcolm says:

    @jas88: Who's Got The Monkey?

    And it was, indeed, the Daily WTF :)

    thedailywtf.com/…/Ive-Got-The-Monkey-Now.aspx

Comments are closed.