Debugging the mystery of the crashing desktop

My home desktop running Vista would crash with a blue screen everytime I logged in the main console if I had remotely logged in via TerminalServices prior to that. This was happening every time I connected remotely to the box and couple of weeks earlier I found another way of crashing it. I was able to consistently crash the system if I logged in as Guest and then login with main account. Prior to today, I never bothered to check why it crashed as I attributed the crash to some faulty driver not updated for vista and I had a simple workaround of restarting the computer before disconnecting from my remote session (I know its not the ideal way but it worked for me). My desktop is a Gateway T3306, a "cheap" desktop that I had purchased during 2005 thanksgiving. The desktop was bare bones and I had beefed it up by adding extra memory/disk.

Last week I attended David Solomon's  5 day course on "Windows OS Internals" and part of that talked about debugging a system crash dump. After the exercise I decided to come home and check the dumps to see which driver was actually causing the crash.

I started up Windbg and loaded the latest dump from \windows directory and the autoanalyze gave me the following output.

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except,
it must be protected by a Probe. Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: dd05221e, memory referenced.
Arg2: 00000000, value 0 = read operation, 1 = write operation.
Arg3: 90ea9301, If non-zero, the instruction address which referenced the bad memory
address.
Arg4: 00000002, (reserved)

FOLLOWUP_NAME: MachineOwner

SYMBOL_NAME: win32k!SearchIconCache+20

FAILURE_BUCKET_ID: 0x50_win32k!SearchIconCache+20

BUCKET_ID: 0x50_win32k!SearchIconCache+20

Followup: MachineOwner

The followup usually points to a driver if it finds one in the thread callstack that caused the KiTrap0E error (crash). MachineName usually means that it didnt find a driver in the stack. Walking the stack using the k command produced the following output.

 kd> k
ChildEBP RetAddr
a0c9fc30 81c8fa74 nt!MmAccessFault+0x106
a0c9fc30 90ea9301 nt!KiTrap0E+0xdc
a0c9fcc4 90ea93bc win32k!SearchIconCache+0x20
a0c9fce4 90ea94ab win32k!_FindExistingCursorIcon+0x4a
a0c9fd50 81c8c96a win32k!NtUserFindExistingCursorIcon+0xe5
a0c9fd50 77ce0f34 nt!KiFastCallEntry+0x12a
WARNING: Frame IP not in any known module. Following frames may be wrong.
001be900 00000000 0x77ce0f34

So the stack points to a Win32k thread trying to call a SearchIconCache method. Tried searching on the method name but didnt hit anything. So tried to see what it was trying to do with memory location dd05221e mentioned in the argument so the bubcheck. Did a unassemble at the  return address for the Trap0E function and got this.

kd> u 90ea9301
win32k!SearchIconCache+0x20:
90ea9301 663b461c cmp ax,word ptr [esi+1Ch]
90ea9305 754e jne win32k!SearchIconCache+0x74 (90ea9355)
90ea9307 f6462004 test byte ptr [esi+20h],4
90ea930b 7448 je win32k!SearchIconCache+0x74 (90ea9355)
90ea930d 668b461e mov ax,word ptr [esi+1Eh]
90ea9311 663b4704 cmp ax,word ptr [edi+4]
90ea9315 753e jne win32k!SearchIconCache+0x74 (90ea9355)
90ea9317 8d4614 lea eax,[esi+14h]

The first statement compares a value pointed by pointer (esi + 1ch). ESI register had the value dd052202  and adding 1CH points to memory location dd05221e which is the same as the one pointed by the bugcheck argument. The memory at dd05221e pointed to nothing/garbage.

kd> dd dd05221e
dd05221e ???????? ???????? ???????? ????????
dd05222e ???????? ???????? ???????? ????????
dd05223e ???????? ???????? ???????? ????????
dd05224e ???????? ???????? ???????? ????????
dd05225e ???????? ???????? ???????? ????????
dd05226e ???????? ???????? ???????? ????????
dd05227e ???????? ???????? ???????? ????????
dd05228e ???????? ???????? ???????? ????????

the address dd05221e is kernel mode address space and hence this meant some kernel mode code was pointing to a wrong address or had its address space overwritten by a rogue driver as the memory could have been corrupted long before poor Win32K tried to do what it was doing. The process that was running this thread during the crash was mobsync.exe and from running procmon on my live system I found that it was the Microsoft Mobile Sync service running within the "Plug n Play" svchost service. Apparently this kicks in when I open the "Sync Center" or when you plugin a Windows mobile device which I had done occasionally on this PC. To confirm whether this process was the culprit each time my system crashed I opened all Minidumps from \Windows\Minidumps folder. Vista by default stores minidumps of all crashes in \Windows\MiniDump folder but keeps overwriting the full kernel dump at \Windows\Memory.dmp. I was surprised to find that each minidump showed the same mobsync process  as the culprit with identical callstack as the kernel dump. I was almost convinced to fill a bug with the sync team and add the kernel dump for reference but then I decided to check one more detail before filing the bug. I read up on the BugCheck code 0x50 and here is what the documentation had to say about it.

  • Cause

  • Bug check 0x50 usually occurs after the installation of faulty hardware or in the event of failure of installed hardware (usually related to defective RAM, be it main memory, L2 RAM cache, or video RAM).

  • Another common cause is the installation of a faulty system service.

  • Antivirus software can also trigger this error, as can a corrupted NTFS volume.

Great, the cause could be

  • faulty RAM (Which I ruled out as I just had a faulty RAM crash at work and the blue screen was completely different for that. I also confirmed this by looking at the BIOS log which usually mentions which DIMM faulted)
  • Video RAM
  • System Service (Which mobsync.exe is)
  • AntiVirus (I had to rule out service and video ram before trying this out)

My Video card is an integrated S3/Via UniChrome Pro card from S3Graphics.com and looking up their site for updated Vista drivers yielded no results. That leaves me with two options, spend $$ on a video card and hope the problem goes away or try to confirm the issue with the mobsync team before spending the money. I think I am going to opt for the second option. I plan on turning on Driver Verification for the video card and then trying to reproduce the crash to see if DriverVerifier catches it and if that doesnt help then file the bug and wait. I will update later on how my DriverVerifier experiment went along.

Bottom line is I got more out of "Windows OS Internals" class that what I had initially hoped and I would strongly recommend it esp for people working on drivers. Mark Russionovich made a guest appearance at the class talking about UAC and impressed everyone with his brilliance.

Maheshwar Jayaraman