Help! My Server is Shutting Down for No Apparent Reason

Hello - Rob here with the GES team, and I have this nugget to pass on to you. I recently worked an issue where a Windows server rebooted intermittently for no apparent reason. The Windows System Event log did not yield any clues, other than this Event ID 6008-

 

Log Name: System.evt

Source: EventLog

Date: 25-8-2008 19:06:58

Event ID: 6008

Task Category: None

Level: Error

Keywords: Classic

User: N/A

Computer: A2A000001

Description: The previous system shutdown at 6:54:04 PM on 8/25/2008 was unexpected.

 

There were no other symptoms or patterns to which the unexpected shutdown could be related. The shutdown could occur anytime of the day. Eventually we attached a debugger to see if we could catch anything, but this wasn’t successful. Next we looked at the manufacturer’s mechanism used to log errors and found this piece of information -

 

An Unrecoverable System Error has occurred (Error code 0x0000002D, 0x00000000)

 

Note - each vendor has their own way to handle error codes. We noticed a one to one relationship with the vendor error above and the Event ID 6008 messages in the Windows System Event log. So we engaged the hardware vendor who determined this error indicated an error on the PCI bus. They also informed us that this kind of error asserts an NMI on the bus.

 

To narrow down which component was causing the error, we set the NMICrashDump DWORD value under the following key in the registry:

 

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl

 

This is described in detail in the article, “927069 How to generate a complete crash dump file or a kernel crash dump file by using an NMI on a Windows-based system”

https://support.microsoft.com/default.aspx?scid=kb;EN-US;927069

 

This registry value causes the machine to bugcheck with a STOP 0x80 (NMI_HARDWARE_FAILURE) when Windows detects an NMI, thus producing a dump file, or, if a debugger is attached, it breaks into the debugger

 

After setting this registry value we hooked up the debugger again and waited... after awhile we got lucky because the debugger intercepted a STOP 0x80!

 

At that time, I ran “!pci 0x102 ff” to get an overview of the various PCI devices and their respective states. The !pci output showed the following output (VendorID and DeviceID have been removed):

 

PCI Configuration Space (Segment:0000 Bus:00 Device:1e Function:00)

Common Header:

    00: VendorID <vendor>

    02: DeviceID <device>

    04: Command 0147 IOSpaceEn MemSpaceEn BusInitiate PERREn SERREn

    06: Status 4010 CapList SERR

    08: RevisionID d9

    09: ProgIF 01 Subtractive

    0a: SubClass 04 PCI-PCI Bridge

    0b: BaseClass 06 Bridge Device

    0c: CacheLineSize 0000

    0d: LatencyTimer 00

    0e: HeaderType 01

    0f: BIST 00

    10: BAR0 00000000

    14: BAR1 00000000

    18: PriBusNum 00

    19: SecBusNum 01

    1a: SubBusNum 01

    1b: SecLatencyTmr 20

    1c: IOBase 20

    1d: IOLimit 30

    1e: SecStatus 6280 FB2BCapable InitiatorAbort SERR DEVSELTiming:1

    20: MemBase f7e0

    22: MemLimit f7f0

    24: PrefMemBase d801

    26: PrefMemLimit dff1

    28: PrefBaseHi 00000000

    2c: PrefLimitHi 00000000

    30: IOBaseHi 0000

    32: IOLimitHi 0000

    34: CapPtr 50

    38: ROMBAR 00000000

    3c: IntLine ff

    3d: IntPin 00

    3e: BridgeCtrl 000b PERRREnable SERREnable VGAEnable

 

We couldn't have gone much further without the vendor's assistance. They informed us that the Status shows us SERR, which indicates a PCI System Error has occurred in this PCI-PCI Bridge. At this point I had enough conclusive data to pass my findings to the hardware vendor for full collaboration on the problem. They continued investigating the issue.

 

It should be noted that a hardware problem is not the only reason for an Event ID 6008. A quick search in the Microsoft Knowledge Base illustrates other things that could cause the event id to appear in the Windows System log.

Share this post :