Larry and the "Ping of Death"


Also known as “Larry mounts a DDOS attack against every single machine running Windows NT”

Or: No stupid mistake goes unremembered.

 

I was recently in the office of a very senior person at Microsoft debugging a problem on his machine.  He introduced himself, and commented “We’ve never met, but I’ve heard of you.  Something about a ping of death?”

Oh. My. Word.  People still remember the “ping of death”?  Wow.  I thought I was long past the ping of death (after all, it’s been 15 years), but apparently not.  I’m not surprised when people who were involved in the PoD incident remember it (it was pretty spectacular), but to have a very senior person who wasn’t even working at the company at the time remember it is not a good thing :).

So, for the record, here’s the story of Larry and the Ping of Death.

First I need to describe my development environment at the time (actually, it’s pretty much the same as my dev environment today).  I had my primary development machine running a version of NT, it was running a kernel debugger connected to my test machine over a serial cable.  When my test machine crashed, I would use the kernel debugger on my dev machine to debug it.  There was nothing debugging my dev machine, because NT was pretty darned reliable at that point and I didn’t need a kernel debugger 99% of the time.  In addition, the corporate network wasn’t a switched network – as a result, each machine received datagram traffic from every other machine on the network.

 

Back in that day, I was working on the NT 3.1 browser (I’ve written about the browser here and here before).  As I was working on some diagnostic tools for the browser, I wrote a tool to manually generate some of the packets used by the browser service.

One day, as I was adding some functionality to the tool, my dev machine crashed, and my test machine locked up.

*CRUD*.  I can’t debug the problem to see what happened because I lost my kernel debugger.  Ok, I’ll reboot my machines, and hopefully whatever happened will hit again.

The failure didn’t hit, so I went back to working on the tool.

And once again, my machine crashed.

At this point, everyone in the offices around me started to get noisy – there was a great deal of cursing going on.  What I’d not realized was that every machine had crashed at the same time as my dev machine had crashed.  And I do mean EVERY machine.  Every single machine in the corporation running Windows NT had crashed.  Twice (after allowing just enough time between crashes to allow people to start getting back to work).

 

I quickly realized that my test application was the cause of the crash, and I isolated my machines from the network and started digging in.  I quickly root caused the problem – the broadcast that was sent by my test application was malformed and it exposed a bug in the bowser.sys driver.  When the bowser received this packet, it crashed.

I quickly fixed the problem on my machine and added the change to the checkin queue so that it would be in the next day’s build.

 

I then walked around the entire building and personally apologized to every single person on the NT team for causing them to lose hours of work.  And 15 years later, I’m still apologizing for that one moment of utter stupidity.

Comments (34)

  1. MSDN Archive says:

    Ah, but you *did* uncover the bug, and probably saved billions from losses due to maliciously malformed packets.

    Though it does bring up the idea of isolated networks for stuff like this.

  2. Anonymous says:

    > I quickly root caused the problem – the broadcast that was sent by my test application was malformed and it exposed a bug in the bowser.sys driver.  When the bowser received this packet, it crashed.

    Bowser.sys? There’s a whole *driver* dedicated to dogfooding?

  3. I thought I’d done the story of hte name of the bowser before.  It’s because the driver is "such a dog" :).  My boss at the time had a colorful way with names

  4. Anonymous says:

    Sounds like you’re being harsh on yourself. Can’t see anything you did as being stupid – it wasn’t your fault that bowser.sys was buggy and caused OS crashes. (unless you also wrote that).

    You sent out a malformed packet. Whoop-de-do. The network should be able to handle that.

    The only possible reason you might have to be hard on yourself is the "doing it again" thing. But 1) you didn’t cause people to lose much work there ‘cos they’d only just rebooted from last time, and 2) I don’t think spotting cause and effect from the first time around is something that would be expected. First time might be a coincidence. Simultaneous crashes on your machines due to an unrelated local other cause (power fluctuations in your office?).

    Nah, that’s not stupidity.

    Now, going round apologising and letting everyone think it was your fault – that might have been a little foolish 🙂

  5. Karellen: I wrote bowser.sys too.  

    Actually a single failure would have been excused.  Stuff does happen, and we all know that.

    The reason this became a legend was that I did it a second time.

    And that was inexcusable.

  6. Anonymous says:

    Doesn’t a story like this belong in Us Magazine though, in the "They’re Just Like Us" section?  I want to see a picture of Larry with a big caption saying, "THEY BRING DOWN ENTIRE CORPORATE NETWORKS!"

  7. Anonymous says:

    Technically, wouldn’t this be a plain old DOS attack rather than a DDOS attack?  From what you wrote, the PoD packets were from a single source (your machine) so they weren’t really "distributed".

  8. Chris: I was wondering if someone would think of that.  I figured it was "distributed" because one packet sent from my dev machine was distributed to several thousand other machines and crashed them all.

  9. Anonymous says:

    Well, I’m no expert but as I understand it, back in Ye Olden Days, the conventional way to carry out a denial of service attack was to subvert a powerful machine with a big internet pipe and use it to launch a flood of traffic at the target computer.  Two problems with this: first, as the computers people were trying to take down with DoS attacks got more powerful, eventually becoming services running on multiple computers, it got harder and harder to find a computer big enough to overwhelm them.  There isn’t a single computer in the world powerful enough to DoS Google, for instance.  Second, a single source attack is relatively easy to deal with.  While there are methods of disguising the origin of a DoS attack (forging information on the packets, for instance) it’s still possible to trace such a big flood of packets back to the origin.  That means most DoS attacks could be dealt with by either getting the owner to clean out the subverted system, or getting its ISP to filter the traffic or shut down their connection entirely.  

    These days, rather than using one big system, they started subverting a lot of systems into a botnet (including desktop machines as well as big servers) often using viruses, worms, trojans, or other automated mechanisms, and using them to launch a coordinated DoS attack.  This sort of Distributed Denial of Service attack is a lot harder to stop.  Each machine is sending out less traffic, so they’re harder to trace back.  Even if you can, there’s so many of them that tracking down each one and dealing with the owner or ISP is effectively impossible.  This makes DDoS attacks much harder to combat than old style single-machine DoS attacks.  It also scales to attack websites and services that have far too much hardware behind them to be brought down by a single machine trying to DoS them.  Now virtually all denial of service attacks are distributed.

  10. Anonymous says:

    I guess it would be a reverse DDoS attack, given that a normal DDoS is a bunch of machines bringing down one.

  11. Anonymous says:

    Not quite the same thing, but when I was testing Winsock, I used JamesG’s harness api tester, on what I mistakenly believed to be my office isolated network.  Hey, I was curious about how the competitor’s TCP/IP stacks would handle it.  

    Buildings 1-4 had problems keeping up with the "very large" broadcast packet.  I told my test manager and PM about it, and they both agreed that the incident should be forgotten asap and never brought up again.

    Shame on me, and I quickly removed all of my office test machines in the lab.

  12. Anonymous says:

    > The reason this became a legend was that I did it a second time.

    > And that was inexcusable.

    But that is excusable, and enormously important.  The first time you did it, you didn’t know.  The second time you did it, again you didn’t know at first, but when you knew about it, you released a fix.  Your fix eventually reached millions of customers, right?  The only surprising part of this is that Microsoft didn’t fire you for making a fix that eventually reached millions of customers.  Outside of Microsoft, you’d be a hero.

    Compare that to the Excel bug, where the typically Microsoftian decision was to not release a hotfix.  Someone must have got a big bonus for deciding not to release that hotfix.

    The way to get memories of that event to be forgotten would be to store them on hard drives partitioned by Windows.  That’ll get all those memories wiped out.  Still.  Thank you for bucking Microsoft’s system and getting your fix out the door.

  13. Norman: Huh?  The Excel guys issued a hotfix ASAP.  And this was way early in the development process (years before we shipped).

  14. Anonymous says:

    > The Excel guys issued a hotfix ASAP.

    Last I saw, Microsoft wasn’t distributing the hotfix but was considering including it in a service pack.

  15. Anonymous says:

    Sorry, I see it is published, just not automatically updated by automated tools.  Sorry.

    http://support.microsoft.com/default.aspx/kb/943075/

  16. Anonymous says:

    Larry, I have to be honest, I’m glad that Windows Vista shipped with WDS, it seems to be completely stable, quick, and the UI is asynchronous (even when enumerating old NT Browser systems).

    The instability and synchronous enumeration of the old browser list caused lots of application freezes on old versions of Windows (e.g. a Save File dialog in an MS Office application when the user wanted to store the file on a server). Some people blamed the network, others blamed their "slow" computer… 😉

  17. Mike Dimmick says:

    I recently saw an oddity on a colleague’s PC running Windows XP: network name lookup (i.e. Start > Run > \servername) had completely stopped working.

    When we looked at netdiag /test:winsock /v, it showed that there were a HUGE number of registered NetBT bindings, over 200. This is because he uses the laptop for commissioning Windows Mobile 5.0 devices, i.e. installing software on them then shipping them to the customer. ActiveSync in WM 5.0 is implemented using RNDIS – the device emulates a USB-connected network adapter. Each different device has its own serial number, so USB sees it as a different device. Guess what happens after you’ve plugged 100 different devices into the computer? You have 100 network adapters, bound to both TCP and UDP. Windows doesn’t clean them up because they might eventually come back.

    The workaround was to set the DEVMGR_SHOW_NONPRESENT_DEVICES environment variable, launch Device Manager, select View/Show Hidden Devices and delete every one of the ‘Windows Mobile-based Device #nnn’ devices under Network Adapters. Having done this, file sharing suddenly started working again.

    I’d better do this soon on my PC, I’m up to Device #48. Anyone know of an automated way to delete these devices?

    (Sorry, Larry, I know it’s a bit tangential, is bowser involved in any way?)

  18. Mike: Not to my knowledge.  The browser is disabled by default on XP as far as I know.

  19. Anonymous says:

    Here is the story of my own DoS attack.

    We have a series of computers we use to do distributed resource builds.  These computers take the raw game files (textures, models, etc) and processes them making them ready for the game.

    We had just made a series of improvements to improve the performance of the system and released the new software.  That night we get a "nice" email from IS saying they shutdown our build servers because they had taken down the phone system.

    What?

    It turns out that the programs we use to process the data contained diagnostic code that sent around 40-50 UDP broadcast packets every time the program started.

    Oh, did I mention that these build computers are all high speed multi-cpu, multi-core computers.

    Oh, did I mention that the programs to process the data only take a very short amount of time so they get run a LOT.

    Oh, did I mention that these high speed computers were all sitting in the server room on a 1GB network?

    🙂

    I did a lot of apologizing for taking down the company phone system.

  20. Anonymous says:

    I read bowser.sys and thought, "King Koopa has now invaded my OS kernel! All hope is lost!"

  21. Anonymous says:

    @Mike: One thing to try is to add a registry key to:

    HLKMSystemCurrentControlSetControlUsbFlags

    with Value name: IgnoreHWSerNumVVVVPPPP and Value DWORD:0x1

    Where VVVV = USB Vendor ID in Hex

    PPPP = USB Product ID in Hex

    This key prevents the USB layer from creating individual per serial number nodes under HKLMSystemCCSEnumUSB. You will have to reboot after this change. Note that the Found New HW Wizard will no longer prompt you for the driver for each newly found device after this change.

    I’m not sure about the exact scenario that you’re describing, but if the mechanism relies on the USB serial number (as opposed to the MAC address in the USB network adapter) it might help. (Our HW has a USB serial number, and in production testing, the registry quickly fills up with the EnumUSB nodes for each device connected if you do not use this key…)

    Larry: Sorry for the totally-off-topic.

  22. Anonymous says:

    I hope that it’s just good natured ribbing. After all, most developers probably wouldn’t have know what was happening and just continued. Someone from IT would have had the unenviable task of tracking down the source of the disruption. Now -that- would have been embarrassing.

    The way brains focus in on the task at hand, it’s not surprising that you didn’t catch it the first time. You have to step out of the box you’re in and change your context.

    Again, as long as it’s good natured, it’s fine to keep bringing it up though. That’s what good friends are for. 😉

  23. nathan_works says:

    Matt, Glad I wasn’t the only one thinking Mario Bros.

  24. Anonymous says:

    OK, if we’re discussing our own DoS tales, I’ll tell mine.

    The first time I configured a corporate intranet, I made two DNS servers query each other first and then query the ISP.  So if one of them received a query from an ordinary client, then a chain reaction started with each server querying the other back and forth and both of them sending queries to the ISP until they finally got an answer back.  After a while things settled down.  When I figured out what was happening, first I fixed it, and then I asked the ISP if maybe the reason why things settled down might be that they blacklisted us.  They said no, they hadn’t observed any problem.  Whew.  Anyway it lasted less than an hour and I figured out a less recursive configuration.

  25. Anonymous says:

    Sounds like some kind of epic adventure inside Microsoft:

    Deep in the bowels of Microsoft is a lone programmer, sparring with a particularly merciless code fault. Long ago the daylight had forsaken him; the cold night was without stars and moon; he slowly began to sink into the dreary gloom of despair.  His mood worsened towards the brink of failure.

    As the night wore on, a minstrel came forward and proclaimed, "I will sing to you of Larry of the Third NT, and the Ping of Death."

    And when he heard that he laughed aloud for sheer delight, and he stood up and cried "O great glory and splendour! And all my wishes have come true!" and then he wept.

  26. Anonymous says:

    Heh, that sounds like something that happened to me back in high school. Only it might not have been an accident. It might have been a Perl script, running on a secret Linux server, iterating over the school’s IP range. It might have been  pinging each address with a malformed packet and it may have bluescreened every Windows 9x computer in the school. However, that is just wild speculation on my part. Nothing that I know anything about.

  27. Tanveer Badar says:

    LOL!

  28. Anonymous says:

    I still remember while testing an early version (a beta) of Operations Manager (which later became MOM and now is OpsMgr – but it was still missioncritical software’s at that time) that had a bug: instead than notifying the network Administrator with a NET SEND, it would notify EVERY SINGLE USER in the domain. So, testing it on the production environment it did flood everybody in the company with Alert popups…. OK, it did not actually crash anything, but still… the CEO of the company I was working at did not quite like that too…

  29. Anonymous says:

    heh! i brought down my corporate network one day, crashing every Win 3.1 machine… probably 30 or so people.

    We had BNC cabling (was that 10-baseT? I forget) configured as a ring with every machine on it… I was playing with a screwdriver in my machine (putting an 8 port serial card in) and accidentally shorted the network… Immediate swearing including the ferociously bad tempered and intimidating CEO (at the time, I was 21) who stormed out of his office swearing "Who the @#$% did that! What the $^%^ caused that".

    He then saw me with screwdriver in hand… "Was that you? Do you know how much %^&ing work I’ve lost?" Fortunately another guy I worked with, who i hadn’t liked very much until that point said, "Nope, wasn’t him, must’ve been the Novell server crashing. It does that sometimes."

    Ah, fond memories!

  30. Anonymous says:

    Here’s a great anecdote " Larry and the Ping of Death " from Larry Osterman, if you’re not subscribed

  31. jackbond says:

    How long afterwards did it take for MSFT IT to call Cisco for some switches? Seems like it should have been the IT department apologizing.

  32. Anonymous says:

    Here's a great anecdote " Larry and the Ping of Death " from Larry Osterman, if you're

  33. Anonymous says:

    http://www.delymyth.net/blog/iphone-serversman-e-liphone-diventa-un-server-web quando mobasta precari m ha scritto che erano anche su youtube ho preso subito questo video <span sty…