Anatomy of a software bug, part 2 – the NT browser

Yesterday, I talked about the design of the NT browser service.

Today, I want to talk about a really subtle bug we ended up finding in the service (fixed long before we shipped NT 3.1).

As a brief refresher from yesterdays post, the NT browser was effectively a distributed single-master database system which was designed to run completely without administration.  All the machines that participated in the browsing architecture were elected to their position, the user wasn’t involved in that process.

The WfW browser used NetBIOS names to determine which machines had what role in the workgroup.  In general, the names followed a well established pattern of naming that was used for all the MS-NET products (since MS-NET 1.0 was introduced in 1983).  NetBIOS names are 16 byte flat names, in the MS-NET naming scheme, the last byte of the name was used for a signature, the first <n> bytes of the name were used for the computer name and the bytes between <n> and 15 were filled with 0x20 (space).  For example, the MS-NET server used <name>0x20 for the computer name.  MS-NET workstations used <name>0x00.

NetBIOS names come in two flavors: Unique and Group.  Unique names are guaranteed to be associated with a single computer on the network.  Group names are shared between multiple machines on the network.  Unique names receive unicasts (directed traffic), Group names receive multicasts (broadcasts).

For the browser, the master browser was identified because it had registered a NetBIOS name of <workgroup>0x1d.  The backup browsers and potential browsers all register the group name of <workgroup>0x1e.  When servers announce themselves, they send datagrams to <workgroup>0x1d.  There were other names used, and other functionality, but…

Ok, that’s enough background to describe the bug.

As I mentioned yesterday, we cooked the browser election algorithm to ensure that an NT machine would always win the browser election.  Unfortunately, when we started wide deployment of NT machines on the corporate campus, this wasn’t always the case.  We had tools that monitored the state of browsing in the most common domains on the network, and about once or twice a day, browsing would simply stop working on one or more of the domains.

The maddening thing was that this behavior was totally unreproducable – all we knew is that there was a WfW machine that had held onto the master browser name, and this WfW machine was preventing the NT machine from becoming the new master browser.  The NT machine was trying, but the WfW machine kept holding onto the name.  The really annoying thing was that the WfW machine had apparently forgotten that it was a master browser (even though it was holding onto the master browser name).

We gathered sniffs, we looked at code, we were clueless.

Eventually, after talking to the WfW team, we discovered the WfW bug that was causing it to forget that it had had the master browser name – essentially there was a code path that would cause it to think it had won the election, and it started to become the master browser.  If, during the process of registering the NetBIOS name for the master browser, it received an election packet that would cause it to lose the election, it stopped functioning as the master browser, but it forgot to relinquish the NetBIOS name.  So the browser application on WfW didn’t think that it owned the NetBIOS name, but the network transport on the WfW machine thought it owned the name.

Ok, we’d found the bug, and it was the WfW team’s bug.  Unfortunately, by this time, they’d already shipped, so they couldn’t fix their code (and it wouldn’t matter because there was a significant deployed base of WfW machines).  The thing is that they’d done a LOT of testing, and they’d never seen this problem.  So why was the NT browser exposing this?

Well, we went back to the drawing boards.  We looked over the NT browser election logic.  And we looked at it again.

And again.

We stared at the code and we just didn’t see it.

And then one day, I printed out the code and looked at it one final time.

And I saw what we’d missed in all those code reviews before.

You see, there was one other aspect of the election process that I didn’t mention before.  As a mechanism to ensure that elections damped down quickly (there’s a phenomenon called “livelock” that occurs with distributed election systems that prevents elections from finishing), there were several timers associated with the election process.  Once an election request was received, the master browser would delay for 200ms before responding, backup browsers would delay for 400ms and potential browsers would delay for 800ms.  This ordering ensured that master browsers would send their request first, thus ensuring that the election would finish quickly (because if there was a current master browser, it ought to continue to be the master browser).

Well, the code in question looked something like this (we all used text editors at the time, there weren’t any GUI editors available):

When we did our code reviews, this is what we all saw (of course this isn’t the real code, I just mocked it up for this article).

If I’d looked at the entire line, I’d have seen this:

Pseudo-browser source code showing incorrect timers for master browser elections

Note that the master browser case is using the backup browser timer, not the master browser timer.  It turns out that this was the ENTIRE root cause of the bug – because the master browser was delaying its election response for too long, the WfW machines thought they had won the election.  And they started to become masters, and during that process, they received the election packet from the master browser.  Which quite neatly exposed the bug in their code.  Even without the WfW bug, this bug would have been disastrous for the browsing system, because it would potentially cause the very livelock scenario the election algorithm was designed to remove.

Needless to say, we quickly fixed this bug, and deployed it in the next NT build, and the problem was solved.

So what are the lessons learned here?  Clearly the first is that code reviews have to be complete – if text is wrapping off the screen, it’s not guaranteed to be correct.  Also, distributed systems misbehave in really subtle ways – a simple bug in timing of a single packet can cause catastrophic behaviors.

Comments (17)

  1. G. Man says:

    Interesting, when I first saw that screenshot I thought you were going to say that the problem was in the chained if statements…

    if (IAmMasterBrowser())

    else if (IAmBackupBrowser())

    What if my status changed between evaluation of the "if" statements? You would end up not scheduling any election response.

    I think I have been trained by too many "whats wrong with this code" posts :)

  2. G Man: See the comment below the first screen shot: This is just a stupid little mockup – heck, the filename’s from the last "what’s wrong with this code" post :) It’s totally bogus.

    But good eyes, nontheless.

  3. Jeff Parker says:

    Thanks Larry, the first one left me going hmm where is the bug, this is more eveolution of code. But now I see the bug, Kind of a two fold bug, a bug in the code and a bug in the code review process itself not seeing every line completely, one of those gotchas. However I bet it never ever happened to you again.

    Did you ever run across this same scenario again later where people did the same thing in thier code reviews.

  4. Yup, I’ve never done that one again. That’s why I love this particular story, because there was both a really subtle defect (actually, two subtle defects) AND a breakdown in the process of detecting the defect.

  5. petal says:

    You wouldn’t happen to have a similar story about what was done to NetBT.SYS in NT4 SP6 that slowed down network connections for certain types of file operations? It was undone in SP6a but caused me some real headaches.

  6. Nope, sorry :( I didn’t work on netbt (and I wasn’t on NT in NT4, I left just about when NT 3.5 shipped)

  7. Richard says:

    So True Larry, I had an interesting bug which was causing a timing/delay issue similar to what you what you experienced. All I can say is that these bugs are tough to find – not very nice indeed.

  8. Steve says:

    I’d worry about a DoS problem here – all I have to do is hold on to a packet inside a switch for a measly 200ms and I can break the network?

    Not such a big deal since WfW is gone, and I assume the code path has been long since replaced, but the coding error in NT should only have caused a little bit of network inefficiency. Security is subtle sometimes.

  9. Steve, that’s possible, I don’t know – I do know that just like the rest of the system, it’s been thoroughly reviewed and re-reviewed. Also, the design HAS changed in the 11+ years since I worked on that code (thus you should make NO assumptions about the current design from this post).

  10. Interesting – I’ve found printing out sections of code to be a very useful debugging aid, but usually because I’ve found myself reading what I think the code *should* be, as opposed to what’s on screen.

    (Then again, if I’m not reading the screen properly, it’s probably the last few hours before a Uni assessment handin, and I’m not at my best anyway. Trying to debug hastily-written assembler on too much coke and no sleep in recent memory isn’t something I’d recommend to anyone…)

  11. We need more tabs in more editors. Like Visual Basic. In tab based coding changing the tab width allows lines to be seen. Changing it back reveals the structure again. Spaces are silly.

  12. Network Boy says:

    I would have thought that the incorrect delay would have been evident in the network sniffs as well as the code itself.

    Good to know that sniffs were taken though, as too often (IMHO) programmers of networked apps don’t use this valuable source of info about what’s _really_ happening!

  13. Also known as "Larry mounts a DDOS attack against every single machine running Windows NT" Or: No stupid