Catching up - SBS Beta and Watson

Some people have been asking where the hell I've been. Well, life got a little busy. The short short version: I went home to the east coast, visited family, then drove cross-country with my girlfriend on a week long trip. Came back, had family visit, dealt with a lot of stuff, got a dog, blah blah blah. The start of volleyball season didn't help either.

So, things have finally settled down a bit. On the SBS front, we are still working hard to provide SBS SP1 to our customers. Our plan is to release our beta around the same time as Windows Server SP1 RC. You should see details about how to request to be added to our Beta program in a few weeks.

On a separate front - as part of our SP1 goals we're looking at fixing crashes that occur on SBS. We look at 3 different types of data:

  • SBS binary user mode crashes
  • SBS platform user mode crashes
  • OS crashes on SBS (kernel mode)

Where do we get all this? You know that little dialog box that comes up when an app crashes or you kill a process, asking you to send info to Microsoft? We actually collect all that and examine causes and trend lines. The data is fascinating in it's consistency - a small number of bugs always accounts for the vast majority of your crashes.

For example, we've had several thousand reported crashes of the binaries that the SBS team provides since SBS 2003 shipped. However, 80% of those are caused by 30 individual problems/bugs. So fix those 30, and you've solved a lot of pain.

On the platform side, we look at all applications that have crashed while running on an SBS server. This gets a little harder to figure out, as you get a bunch of processes and you have to figure out who owns them. We're still sifting through it all, but we do see some consistent third party utilities that are the most common crashes. (The most common, accounting for 1/3 by itself, was a Windows security fix)

Finally you have kernel crashes (aka BSOD). These are awful because it brings down the whole machine, and we always want to try and fix these. Here again, you get a distinct curve, but it's not as pronounced. In the interest of understanding from people out there what their thoughts are, here are the top 15 BSODs on SBS (note that comments are not mine, I'm not an expert on kernel debugging :-):

1) 0x7a_c000009a – this crash happens when the kernel gets STATUS_INSUFFICIENT_RESOURCES when attempting to read from the pagefile. This happens when the request failed because a filesystem failed to make forward progress.

2) IP_MISALIGNED – this crash happens when CPU’s instruction pointer fails to point to the beginning of a valid instruction. Instead, it points into the middle of an instruction. This is believed to be a hardware problem, possibly caused by over-clocking, over-heating, or power supply problem.

3) 0xD1_storport!RaUnitFlushQueueSrb+45 – this is a storport.sys bug that is fixed in Server SP1. A QFE is available. See https://oca.microsoft.com/en/Response.asp?SID=732

4) 0x7f_8 – this crash happens when the kernel stack is exhausted. The usual cause is too many filter drivers (anti-virus, quota, cd burning) installed on the system.

5) 0xCB_netbios!NbDeviceControl+133 – this is a netbios.sys bug that is fixed in Server SP1.

6) 0xB8_BUGCHECKING_DRIVER_AACMgt – This is a bug in Adaptec’s aacmgt.sys driver. Adaptec has a fix and OCA points customers to it.

7) 0x7E_TmXPFlt+b8db – This is problem with Trend Micro’s common firewall module tm_cfw.sys. I am not aware of a solution from Trend.

8) 0x77_c000000e – this crash is caused by disk hardware errors that prevent reading from the pagefile.

9) 0xA_W_nt!MiRemovePageByColor+af – These are thought to be corruptions caused by hardware problems, but we do not fully understand these.

10) 0xA_nt!MiRemovePageByColor+68 – ditto #9.

11) 0x9C_IA32_GenuineIntel – This is a CPU machine check, i.e. the Intel CPU reports that it has detected some inconsistency and must shut down. These are hardware problems: over-heating, over-clocking, etc.

12) 0x77_c0000185 - this crash happens when the kernel gets STATUS_IO_DEVICE_ERROR when attempting to read from the pagefile. This is a disk hardware problem.

13) 0xC0000218_nt!CmpLoadHiveThread+16b – this is a crash during boot when the registry hive is corrupted. Some of these are caused by disk hardware problems. Others may have software causes.

14) OLD_IMAGE_IPVNMon.sys - this is a driver distributed by Internet Service Providers to monitor broadband line utilization. The driver had a serious bug that caused crashes on any multi-processor (or Hyperthreaded) system. A fixed driver is available from Visual Networks (and probably also from some ISPs). See https://oca.microsoft.com/en/Response.asp?SID=896.

15) 0xB8_BUGCHECKING_DRIVER_afamgt – this is a bug in the DELL perc2 driver (originally written by Adaptec). DELL has a fix available for distribution, and OCA points customers to it.

Note that at least 5 have root hardware causes. Several are third party drivers, a couple are things that are slated to be fixed in Windows Server Sp1. I'd be interested in thoughts people have on hardware issues: have people seen these problems? What types of hardware tends to cause this? Are there better recommendations we can make for hardware purchasing for SBS?

Please help us out by letting us know your thoughts/feedback. And be sure to always send that crash data to Microsoft so we can see which issues you're seeing and work on getting them fixed.

Thanks all

--charlie