The dreaded "beeping death"

Anyone who's been at Microsoft for long enough (long enough to use DOS on a day-to-day basis) remembers the deadly "beeping death".

The "beeping death" was an artifact of the MS-NET product that we deployed for networking here at Microsoft, and I was the developer responsible for the "beeping death".

What was the beeping death?  Well, it occurred because a confluence of about whole lot of different parts of the networking system.

First, you need to understand how connection oriented network protocols work (this is a VERY rough description).  When the client sends a request to the server, the server acknowledges receipt of the package with a packet called an "ACK".  In addition, on a connection oriented protocol, the connection itself is kept alive by the client periodically sending a message (called a "keep-alive") to the server - that way, even if there's no network traffic, the client will eventually discover if the server has gone away.

Secondly, the MS-NET product (and the DOS Lan Manager product after it) used the NetBIOS API layer to talk to the network adapter.  But in reality, it didn't.  The MS-NET product instead talked to an abstract networking API layer called the "session" layer, which was a part of the MS-NET product.  From an API standpoint, it was extremely similar to the NetBIOS API layer, but it wasn't quite the same.

Third, there were two different implementations of the session layer.  One (session.exe) was a sample version that was shipped with the OEM kit for MS-NET.  The other was called minses.exe.  Minses.exe provided a minimal session layer that was intended to interface with NetBIOS.  So it functioned as a mapping layer between the MS-NET components and the actual networking stack.

Now one of the cool features of minses was that on synchronous networking calls, the minses would beep the PC speaker while the call was outstanding.  That would let the user know that the system was still thinking about their request, and it hadn't forgotten them.

Fourth, the Microsoft corporate network at the time was (and still is, to my knowledge) the largest, most complicated corporate network on the planet.  We have branch offices in hundreds of countries, there are hundreds of thousands of computers on the network, it's REALLY big network (Raymond tells this story about the network back in the 1990s).  It's a REALLY big, really complicated network.  That means that there are a bazillion failure points on the network, which means that connectivity often went down.

And finally, the networking solution we used back then was based on Ungermann-Bass smart network cards.  These cards were pretty cool actually - when you started the system, the OS downloaded the entire network stack onto the card, which mean that system memory didn't get consumed by the networking stack.  With this fifth piece, the networking guys reading this should start saying "Uh oh"...

 

Now that I've set the stage for the confluence of features, lets see what happens when this system gets deployed in real life..

In the normal case, everything works fine - you never ever hear the beep, because responses come back before the beep comes out.   But that's the most uninteresting problem (for networking environments, the normal case is usually profoundly uninteresting - it's when things start failing that things get exciting)...

And the "beeping death" scenario was no different - it gets interesting when you start looking at the ways that things can fail.

Lets consider some of the failure modes:

    1) Connectivity fails on an intervening network node between the client and the server.

In that case, the client hangs waiting on the network to time out.  This could take several seconds, sometimes even as much as a minute.  Bad, but not the end of the world, because the timeouts within the transport detect the connectivity problem and fail the request.

    2) The client crashes (this IS MS-DOS, we're talking about). 

In this case, the connection is held alive (remember - the actual network transport is running on the UB card, not taking up system memory), and that ties up some resources on the server but it's still not the end of the world (from a clients perspective)

    3) The server computer is really busy.

In this case, the client waits until the server comes back to it.  That may take time and can be really annoying.

    4) The server crashes, or otherwise freezes (breaks into the kernel debugger, etc).

In this case, the server disappears.  If the timing of the request was correct, the servers crash tears down the connections and the clients get networking failures.  If, on the other hand, you're unlucky, the network card might have received the client's request and handed it to the server, but the server hadn't responded to the client.  In that case, there were no outstanding network requests for the client.  Because the transport is sitting running entirely on the network adapter, it has no way of knowing that the host operating system is dead.  The transport just sits there, quietly acknowledging the keep-alive It sits on the network card until the operating system is rebooted.  It can get even more heinous when the server process and is restarted - in that case, the card would sometimes "forget" existing connections until the operating system was rebooted (or power was recycled on the server).

If you happened to be one of the poor clients stuck in this state, they sat there blocked on a synchronous network receive waiting for the frozen server to respond to their request.  Since the server process was gone (or the machine was in the debugger, or...), the client never had an opportunity to detect the failure.  And, since DOS was a single threaded operating system, and the networking requests were executed in the kernel, the user had no choice but to reboot their client.

 

I got indescribable amounts of flack for the beeping death, because it seemed that every time any server crashed anywhere at Microsoft, some set of clients would start beeping forever...  Fortunately, many of the changes I made for DOS Lan Manager 2.0 removed the beeping death (it allowed the client to detect a hung server and tear down the connection to the server even if the underlying network claimed that things were just fine).

 

Networking can be fun :)