Also known as “Larry mounts a DDOS attack against every single machine running Windows NT”
Or: No stupid mistake goes unremembered.
I was recently in the office of a very senior person at Microsoft debugging a problem on his machine. He introduced himself, and commented “We’ve never met, but I’ve heard of you. Something about a ping of death?”
Oh. My. Word. People still remember the “ping of death”? Wow. I thought I was long past the ping of death (after all, it’s been 15 years), but apparently not. I’m not surprised when people who were involved in the PoD incident remember it (it was pretty spectacular), but to have a very senior person who wasn’t even working at the company at the time remember it is not a good thing :).
So, for the record, here’s the story of Larry and the Ping of Death.
First I need to describe my development environment at the time (actually, it’s pretty much the same as my dev environment today). I had my primary development machine running a version of NT, it was running a kernel debugger connected to my test machine over a serial cable. When my test machine crashed, I would use the kernel debugger on my dev machine to debug it. There was nothing debugging my dev machine, because NT was pretty darned reliable at that point and I didn’t need a kernel debugger 99% of the time. In addition, the corporate network wasn’t a switched network – as a result, each machine received datagram traffic from every other machine on the network.
Back in that day, I was working on the NT 3.1 browser (I’ve written about the browser here and here before). As I was working on some diagnostic tools for the browser, I wrote a tool to manually generate some of the packets used by the browser service.
One day, as I was adding some functionality to the tool, my dev machine crashed, and my test machine locked up.
*CRUD*. I can’t debug the problem to see what happened because I lost my kernel debugger. Ok, I’ll reboot my machines, and hopefully whatever happened will hit again.
The failure didn’t hit, so I went back to working on the tool.
And once again, my machine crashed.
At this point, everyone in the offices around me started to get noisy – there was a great deal of cursing going on. What I’d not realized was that every machine had crashed at the same time as my dev machine had crashed. And I do mean EVERY machine. Every single machine in the corporation running Windows NT had crashed. Twice (after allowing just enough time between crashes to allow people to start getting back to work).
I quickly realized that my test application was the cause of the crash, and I isolated my machines from the network and started digging in. I quickly root caused the problem – the broadcast that was sent by my test application was malformed and it exposed a bug in the bowser.sys driver. When the bowser received this packet, it crashed.
I quickly fixed the problem on my machine and added the change to the checkin queue so that it would be in the next day’s build.
I then walked around the entire building and personally apologized to every single person on the NT team for causing them to lose hours of work. And 15 years later, I’m still apologizing for that one moment of utter stupidity.