This has been a busy couple of days for a few of us on the team as we had a report of a bug in Windows 7. The specifics of the issue are probably not as important as a discussion over how we will manage these types of situations down the road and so it seems like a good time to provide some context and illustrate our process, using this recent example.
This week a report on a blog described a crashing issue in Windows 7. The steps to reproduce the crash were pretty easy (1) run chkdsk /r on a non-system drive then crash after consuming system memory. Because it was easy to “reproduce”, the reports of this issue spread quickly. Subsequent posts and the comments across the posts indicated that the issue seemed to have been reproduced by others—that is the two characteristics of the report were seen (a) consumption of lots of memory and (b) crashing.
Pretty quickly, I started getting a lot of mail personally on the report. Like many of you, the first thing I did was try it out. And as you might imagine I did not reproduce both issues, though I did see the memory usage. I tried it on another machine and saw the same behavior. In both cases the machine functioned normally during and after the chkdsk. As I frequently do, I answered most of the mail I receive and started asking people for steps to reproduce the crash and to share system dump files. The memory usage did not worry me quite as much as the crash. I began having a number of interesting mail threads, but we didn’t have any leads on a repro case nor did we have a crash dump to work with.
Of course I was not the first Microsoft person to see this. The file system team immediately began to look into the issue. They too were unable to reproduce the crash and from their perspective the memory usage was by design and was a specific Windows 7 change for this scenario (the /r flag grabs an exclusive lock and repairs a disk and so our assumption is you’d really like the disk to be fixed before you do more stuff on the machine, an assumption validated by several subsequent third party blog posts on this topic). We cast the net further and continued looking for crash dumps and reports. As described below we have quite a few tools at our disposal.
While we continued to investigate, the mail I was getting was escalating in tone and more importantly one of the people I responded to mentioned our email exchange in a blog post. So in my effort to have a normal email dialog I ended up in the thick of the discussion. As I have done quite routinely during the development of Windows 7, I added a comment on the original blog (and the blog where this particular email friend was commenting) outlining the steps we are taking and the information we knew to date. Interestingly (though not unfortunately) just posting the comment drew even more attention to the issue being raised. I personally love being a member of the broader community and enjoy being a casual contributor even when it seems to cause a bit of a stir.
It is worth just describing the internal process that goes on when we receive a report of a crashing issue. Years ago we had one of two reactions. Either we would just throw up our arms and surrender as we had no hope of finding the bug, or we would drop everything and start putting people on airplanes with terminal debuggers in the hopes of finding a reproducible case. Neither of these is particularly effective and the latter, while very heroic sounding, does not yield results commensurate with effort. Most importantly while there might be a crash, we had no idea if that was the only instance or if lots more people were seeing or would see the crash. We were working without any data to inform our decisions.
With the internet and telemetry built into our products (not just Windows 7) we now have a much clearer view of the overall health of the software. So when we first hear a report of a crash we check to see if we’re seeing the crash happen on the millions of machines that are out there. This helps us in aggregate, but of course does not help us if a crash is one specific configuration. However, a crash that is one specific configuration will still show up if there is any statistically relevant sampling of machines and given the size of the user base this is almost certain to be the case. We’re able to, for example, query the call stacks of all crashes reported to see if a particular program is on the stack.
We have a number of tools at our disposal if we are seeing a crash in telemetry. You might have even seen these at work if you crash. We can increase (with consent) the amount of data asked for. We can put up a knowledge base article as a response to a crash (and you are notified in the Windows 7 Action Center). We can even say “hey call us”. As crazy as that one might sound, sometimes that is what can help. If a crashing issue in an already shipping product suddenly appears then something changed—a new hardware device, new device driver, or other software likely caused the crash to appear far more frequently. Often a simple confirmation of what changed helps us to diagnose the issue. I remember one of the first times we saw this was when one day unexpectedly Word started crashing for people. We hadn’t changed anything. It turned out a new version of a popular add-in released and the crash was happening in the add-in, but of course end-users only saw Word crashing. We quickly put up instructions to remove the add-in while in parallel working with the ISV to push out a fix. This ability to see the changing landscape, diagnose, and respond to a problem has radically changed how we think of issues in the product.
We are constantly investigating both new and frequently occurring issues (including crashes, hangs, device not found, setup failures, potential security issues, and so on). In fact we probably look into on the order of hundreds of issues in any given month as we work with our enterprise and OEM customers (and therefore hardware partners, ISVs, etc.). Often we find that issues are resolved by code changes outside core Windows code (such as with drivers, firmware, or ISV code). This isn’t about dodging responsibility but helping to fix things at the root cause. And we also make many code changes in Windows, which are seen as monthly updates, hotfixes, and then service pack rollups. The vast majority of things we fix are not applicable broadly and hence not released with immediate urgency—if something is ever broadly applicable we will make the call to release it broadly. It is very important for everyone to understand how seriously we take the responsibility of making sure there are no critical issues impacting a broad set of customers, while also balancing the volume of changes we push out broadly.
To be specific about the investigation around the chkdsk utility, let’s look at how we dove into this over the past couple of days. We first looked through our crash telemetry (both at the user level and “blue screen” level) and found no reported crashes of chkdsk. We of course look through our existing reports of issues that came up during the development of Windows 7, but we didn’t see anything at all there. We queried the call stacks of existing reported crashes (of all kinds, since this was reported) and we did not find any crashes with chkdsk.exe running while crashing. We then began automated test runs on a broad set of machines—these ran overnight and continued for 2 days. We also saw reports related to a specific hardware configuration, so we set up over 40 machines based on variants of that chipset, driver, and firmware and ran those tests. We were not hitting any crashes (as mentioned, the memory usage was already understood). Because some were saying the machines were non-responsive we also looked for that in manual tests and didn’t see anything. We also broadened this to request globally to Microsoft folks to try things out (we have quite a few unique configs when you think of all of our offices around the world) and so we had several hundred more test runs going. We also had reports of the crash happening when running without a pagefile—that could be the case, but that would not be an issue with this utility as any program that requests more memory than physically available would cause things to tip over and this configuration is not recommended for general purpose use (and this appears to be the common thread on the small number of non-reproducible crashes). Folks interested might read Mark’s blog on the topic of pagefiles in general. While we did not identify anything of note, that does not rule out the possibility of a problem but at this point the chances of any broad issue are extremely small.
In the meantime, we continue to look through external blogs, forums and other reports of crashes to see if we can identify any reproducible cases of this. While we don’t contact everyone, we do contact people if the forum and report indicate this has a good chance of yield. In all fairness, it probably doesn’t help us when there’s a lot of “smoke” while we’re trying to find the fire. We had a lot of “showstopper” comments piling on but not a lot of additional data including a lack of a reproducible case or a crash dump.
This type of work will continue until we have satisfied ourselves that we have systematically ruled out a crash or defined the circumstances where a crash can happen. Because this is a hardware/software related issue we will also invite input from various IHVs on the topic. In this case, because it is disk related we can’t rule out the possibility that in fact the disk was either failing or about to fail and the excessive use of the disk during a /r repair would in fact generate a failure. And while the code is designed to handle these failures (as you can imagine) there is the possibility that the specific failure is itself not handled well. In fact, in our lab (running tests continuously for a few days) we had one failure in this regard and the crash was in the firmware of the controller for the disk. Obviously we’ll continue to investigate this particular issue.
I did want folks to know just how seriously we take these issues. Sometimes blogs and comments get very excited. When I see something like “showstopper” it gets my attention, but it also doesn’t help us to have a constructive and rational investigation. Large software projects are by nature extremely complex. They often have issues that are dependent on the environment and configuration. And as we know, often as deterministic as software is supposed to be sometimes issues don’t reproduce. We have a pretty clear process on how we investigate reports and we focus on making sure Windows remains healthy even in the face of a changing landscape. With this post, I wanted to offer a view into some specifics but also into the general issue of sounding alarms.
It is always cool to find a bug in software. Whether it is an ATM, movie ticket machine, or Windows we all feel a certain sense of pride in identifying something that doesn’t work like we think it should. Windows is a product of a lot of people, not just those of us at Microsoft. When something isn’t as it should be we work with a broad set of partners to make sure we can effectively work through the issue. I hope folks recognize how serious we take this responsibility as we all know we’re going to keep looking at issues and we will have issues in the future that will require us to change the code to maintain the level of quality we know everyone expects of Windows.