I recently completed a year at Microsoft. I thought I’d take the chance
to write about some of the fun anecdotes from my one year here. Blood,
sweat, screams tears and joy – I’ve lived through them all in my short
Nothing in Microsoft is feared as much as a P0 bug.
P0, which stands for Priority 0, is the highest importance a bug can be
given. You’ll usually see a P0 bug if there is a build break i.e. for
some reason, the build didn’t complete or the smoke tests on the new
build didn’t pass.
For a Microsoft developer, life stops if you
get a P0 bug assigned to you. To quote a colleague, “a P0 means that
you drop whatever you’re doing, forget about the outside world, and
stop breathing even. The only thing that matters in your life now is
that P0 bug”.
As a Program Manager, dealing with coding bugs is
usually not a part of my day-to-day work– therefore I’ve never had to
deal with P0 bugs for quite a while. Until one late night in August.
2005 was a stressful period. We were in the final stages of shipping
Visual Studio 2005. This meant that we spent all our time fixing late
bugs and cleaning up the product to get it out the door for it’s late
At such a late point in the product’s cycle,
there was always the fear that a showstopper bug might come at you from
nowhere. Of course, there was no logical reasoning for this fear – the
number of bugs being found was dropping drastically. But still, there
was Murphy’s Law and the gnawing fear that something might dramatically
fail at the last moment.
This particular night, I was up late at
office looking at some bug. Finally done with that bug, I packed up my
stuff and prepared to head home when a developer on my team asked me if
I was free. He had a problem – he was investigating a couple of P0
issues and all of a sudden, a third P0 issue had been assigned to him.
the bug to myself, I unpacked my bags again and pulled up to my
computer. I wasn’t worried – a large number of P0 issues are false
alarms and are caused due to some weirdness in the machine running the
test or in the automated test code. I’ve rarely seen P0 bugs due to
actual issues in the code.
This bug sounded relatively benign.
The toolbox was failing to load its controls when you opened a device
project. My guess was that it was due to an unclean machine where
someone hadn’t uninstalled the previous raw build cleanly.
me explain some terminology here. When we did Whidbey, we had two
builds of Visual Studio. One was the layout build which we (and you)
use with the nice graphical installer and everything. The other build
was the ‘raw’ build which just registered the binaries, some debug
tools and registry keys. With raw builds, you could build with debug
information and with optimizations turned off and debug your
way through easily.
In this particular case, our internal
testing tools had just installed the raw build on this machine and were
running through their usually battery of tests. Everything was
hunky-dory until they tried to create a device project, at which point
everything came down in flames since there was nothing in the toolbox.
poked around in the debugger in a bit. After spending 30 minutes or so
in the debugger, I knew why the toolbox wasn’t getting populated. For
the simple reason that .Net Compact Framework wasn’t installed on the
This was weird – NetCF gets installed along with the rest of Visual Studio and there was no reason for it not to be installed.
broke into a cold sweat. Not finding NetCF was a pretty serious problem
– it would render useless most of the device stuff. Had somebody
checked in code which broke the setup? This was a ship stopper bug if I
had ever seen one.
I poked through the logs of the raw build’s
setup (essentially spew from a huge batch file) and the plot thickened.
The log file indicated that the NetCF setup had been run just like the
setup for every other part of Visual Studio. But for some weird reason,
the files weren’t there. It was as if the setup had never happened. I
looked through the event logs and they were a mystery – they told me
that NetCF’s installation was kicked off. But for some reason, they
never recorded the successful completion or any error.
rerunning setup and voila! – NetCF installed perfectly and all tests
passed successfully. I was confused – what had happened that first
time? I tried rerunning setup a few times but couldn’t spot the
problem. Beaten, I told the tester that I couldn’t ‘repro’ the bug and
went home – but not before leaving a note in the bug asking people to
assign any similar bugs to me. If this happened again, I wanted to know
I didn’t have to wait long. 24 hours later, there was
a mail in my inbox from another tester. A new build had exhibited the
same issue. The toolbox wasn’t loading controls.
myself down and logged in remotely into the test machine. The scene of
the crime was identical to the last time. No NetCF. Logs indicating the
successful installation of NetCF.
I started poking around in the
event logs of the machine. What was happening in the machine around the
time of the incident? Was there a hardware failure? Was there a network
The event logs didn’t reveal much. However, one entry
caught my eye – just before the entry documenting the NetCF setup,
there was an entry about a Systems Management Server operation on the
machine. SMS is how we deploy updates to Windows and other internal
tools inside Microsoft and they’re very routine around here. However,
the reason this caught my eye was due to the fact that I had seen a
similar entry in the previous machine.
On both machines, SMS was doing something a few minutes before NetCF’s setup was kicked off.
was the clincher – I now had a good hunch as to what was happening. I
checked in a piece of code to add logging of NetCF’s setup (basically
the verbose logging option of msiexec). My code also checked now to see
whether the setup failed and if so, it bailed out.
Hardly a day
later was my guess vindicated. NetCF installation failed on another
machine – but this time, my logs had documented the sequence of events.
those of you who know the innards of msiexec, you would know that only
one setup can happen at any point of time. Msiexec.exe tries to grab a
global mutex – if another instance is already holding the mutex, it
fails. The reasoning is simple – you don’t want two setup programs to
be making changes to the system at the same time as they could trample
all over each other.
Visual Studio 2005 setup is made up a few
monolithic MSIs and smaller MSIs such as NetCF which are installed as
part of the installation process. There is no one huge Visual Studio
2005 .msi file.
Some readers might have spotted the ending now.
What was happening was a freak occurrence. In the infinitesimally small
time gap between the end of the previous msi’s setup and the beginning
of NetCF’s setup, the machine was unlucky enough to encounter a setup
request from Systems Management Server. This was causing the NetCF
installation to fail and things went rapidly downhill from thereon. The
difference this time was that my log files had faithfully recorded all
I shot off a mail off to the folks running
the build machines to make sure the SMS updates happened before and we
thankfully never had to deal with this again. We never did find out why
this bug had never occurred earlier with other parts of setup. My guess
is that something in SMS was changed
Moral of the story – I
really don’t have a moral actually. Except maybe, that shipping a
product is an incredibly fun and stressful experience all at once.
[Please post comments at http://www.sriramkrishnan.com/blog/2006/07/p0.html ]
I recently completed a year at Microsoft. I thought I’d take the chance