Anatomy of a software bug, part 2 - the NT browser

Yesterday, I talked about the design of the NT browser service.

Today, I want to talk about a really subtle bug we ended up finding in the service (fixed long before we shipped NT 3.1).

As a brief refresher from yesterdays post, the NT browser was effectively a distributed single-master database system which was designed to run completely without administration.  All the machines that participated in the browsing architecture were elected to their position, the user wasn't involved in that process.

The WfW browser used NetBIOS names to determine which machines had what role in the workgroup.  In general, the names followed a well established pattern of naming that was used for all the MS-NET products (since MS-NET 1.0 was introduced in 1983).  NetBIOS names are 16 byte flat names, in the MS-NET naming scheme, the last byte of the name was used for a signature, the first <n> bytes of the name were used for the computer name and the bytes between <n> and 15 were filled with 0x20 (space).  For example, the MS-NET server used <name>0x20 for the computer name.  MS-NET workstations used <name>0x00.

NetBIOS names come in two flavors: Unique and Group.  Unique names are guaranteed to be associated with a single computer on the network.  Group names are shared between multiple machines on the network.  Unique names receive unicasts (directed traffic), Group names receive multicasts (broadcasts).

For the browser, the master browser was identified because it had registered a NetBIOS name of <workgroup>0x1d.  The backup browsers and potential browsers all register the group name of <workgroup>0x1e.  When servers announce themselves, they send datagrams to <workgroup>0x1d.  There were other names used, and other functionality, but...

Ok, that's enough background to describe the bug.

As I mentioned yesterday, we cooked the browser election algorithm to ensure that an NT machine would always win the browser election.  Unfortunately, when we started wide deployment of NT machines on the corporate campus, this wasn't always the case.  We had tools that monitored the state of browsing in the most common domains on the network, and about once or twice a day, browsing would simply stop working on one or more of the domains.

The maddening thing was that this behavior was totally unreproducable - all we knew is that there was a WfW machine that had held onto the master browser name, and this WfW machine was preventing the NT machine from becoming the new master browser.  The NT machine was trying, but the WfW machine kept holding onto the name.  The really annoying thing was that the WfW machine had apparently forgotten that it was a master browser (even though it was holding onto the master browser name).

We gathered sniffs, we looked at code, we were clueless.

Eventually, after talking to the WfW team, we discovered the WfW bug that was causing it to forget that it had had the master browser name - essentially there was a code path that would cause it to think it had won the election, and it started to become the master browser.  If, during the process of registering the NetBIOS name for the master browser, it received an election packet that would cause it to lose the election, it stopped functioning as the master browser, but it forgot to relinquish the NetBIOS name.  So the browser application on WfW didn't think that it owned the NetBIOS name, but the network transport on the WfW machine thought it owned the name.

Ok, we'd found the bug, and it was the WfW team's bug.  Unfortunately, by this time, they'd already shipped, so they couldn't fix their code (and it wouldn't matter because there was a significant deployed base of WfW machines).  The thing is that they'd done a LOT of testing, and they'd never seen this problem.  So why was the NT browser exposing this?

Well, we went back to the drawing boards.  We looked over the NT browser election logic.  And we looked at it again.

And again.

We stared at the code and we just didn't see it.

And then one day, I printed out the code and looked at it one final time.

And I saw what we'd missed in all those code reviews before.

You see, there was one other aspect of the election process that I didn't mention before.  As a mechanism to ensure that elections damped down quickly (there's a phenomenon called "livelock" that occurs with distributed election systems that prevents elections from finishing), there were several timers associated with the election process.  Once an election request was received, the master browser would delay for 200ms before responding, backup browsers would delay for 400ms and potential browsers would delay for 800ms.  This ordering ensured that master browsers would send their request first, thus ensuring that the election would finish quickly (because if there was a current master browser, it ought to continue to be the master browser).

Well, the code in question looked something like this (we all used text editors at the time, there weren't any GUI editors available):

When we did our code reviews, this is what we all saw (of course this isn't the real code, I just mocked it up for this article).

If I'd looked at the entire line, I'd have seen this:

Pseudo-browser source code showing incorrect timers for master browser elections

Note that the master browser case is using the backup browser timer, not the master browser timer.  It turns out that this was the ENTIRE root cause of the bug - because the master browser was delaying its election response for too long, the WfW machines thought they had won the election.  And they started to become masters, and during that process, they received the election packet from the master browser.  Which quite neatly exposed the bug in their code.  Even without the WfW bug, this bug would have been disastrous for the browsing system, because it would potentially cause the very livelock scenario the election algorithm was designed to remove.

Needless to say, we quickly fixed this bug, and deployed it in the next NT build, and the problem was solved.

So what are the lessons learned here?  Clearly the first is that code reviews have to be complete - if text is wrapping off the screen, it's not guaranteed to be correct.  Also, distributed systems misbehave in really subtle ways - a simple bug in timing of a single packet can cause catastrophic behaviors.