Understanding Storage Timeouts and Event 129 Errors

Greetings fellow debuggers, today I will be blogging about Event ID 129 messages.  These warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

 

Below is an example 129 event:

 

Event Type:       Warning

Event Source:     <HBA_Name>

Event Category:   None

Event ID:         129

Date:             4/9/2009

Time:             1:15:45 AM

User:             N/A

Computer:         <Computer_Name>

Description:

Reset to device, \Device\RaidPort1, was issued.

 

So what does this mean?  Let’s discuss the Windows I/O stack architecture to answer this.

 

Windows I/O uses a layered architecture where device drivers are on a “device stack.”  In a basic model, the top of the stack is the file system.  Next is the volume manager, followed by the disk driver.  At the bottom of the device stack are the port and miniport drivers.  When an I/O request reaches the file system, it takes the block number of the file and translates it to an offset in the volume. The volume manager then translates the volume offset to a block number on the disk and passes the request to the disk driver.  When the request reaches the disk driver it will create a Command Descriptor Block (CDB) that will be sent to the SCSI device.  The disk driver imbeds the CDB into a structure called the SCSI_REQUEST_BLOCK (SRB).  This SRB is sent to the port driver as part of the I/O request packet (IRP).

 

The port driver does most of the request processing.  There are different port drivers depending on the architecture.  For example, ATAPORT.SYS is the port driver for ATA devices, and STORPORT.SYS is the port driver for SCSI devices.  Some of the responsibilities for a port driver are:

  • Providing timing services for requests.
  • Enforcing queue depth (making sure that a device does not get more requests that it can handle).
  • Building scatter gather arrays for data buffers.

The port driver interfaces with a driver called the “miniport”.  The miniport driver is designed by the hardware manufacturer to work with a specific adapter and is responsible for taking requests from the port driver and sending them to the target LUN.  The port driver calls the HwStorStartIo() function to send requests to the miniport, and the miniport will send the requests to the HBA so they can be sent over the physical medium (fibre, ethernet, etc) to the LUN.  When the request is complete, the miniport will call StorPortNotification() with the NotificationType parameter value set to RequestComplete, along with a pointer to the completed SRB.

 

When a request is sent to the miniport, STORPORT will put the request in a pending queue.  When the request is completed, it is removed from this queue.  While requests are in the pending queue they are timed. 

 

The timing mechanism is simple.  There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.  The disk timeout value is a tunable parameter in the registry at: HKLM\System\CurrentControlSet\Services\Disk\TimeOutValue.  Some vendors will tune this value to best match their hardware, we do not advise changing this value without guidance from your storage vendor.

 

The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.  When the unit is reset, all outstanding requests are completed with an error and they are retried.  When the pending queue empties, the timer is set to -1 which is its initial value.

image002

Each SRB has a timer value set.  As requests are completed the queue timer is refreshed with the timeout value of the SRB at the head of the list.

 

The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.

 

I have never seen software cause an Event ID 129 error.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network.