As a result of some internal discussions around this, I decided that a blog post was long overdue on this topic, as the ramifications of setting the timeout value for Disk.sys are not commonly understood completely, and can lead to long recovery times in enterprise environments.
The Windows Disk timeout is stored in the following registry key:
For example, most array vendors will set this value to (at least) 60 seconds. This means, for all intents and purposes, that Windows will not report any problems with timeouts until at least 60 seconds have passed.
If for some reason, the Fibre Channel frame in question was dropped ,this is further exaggerated by the fact that after this timeout is hit, the Disk / Class layer will retry the I/O operation 8 times at the interval of the timeout for disk.sys, and during this time, since Windows is waiting for the SAN to return data, the operating system may appear to be frozen, depending on what the data was.
This means that with the value of 60 seconds, the effective impact to the user is the following:
- I/O Timeouts will not be exposed in the Windows Event log until after 60 seconds AND the timeouts have expired, so effectively, this translates to an 8 minute delay before the I/O is retried.
While there is not any single rule which will apply equally to everyone in this case, I think most people would agree that waiting 8 minutes is far too long.
Going forward, Microsoft is recommending that the Disk timeout value be set as low as possible, and no greater than 20 to 30 seconds at a maximum. In this way, you can decrease the time to recovery on dropped I/O packets.
Why not just set it to 1 second? I don’t want to wait that long:
Keep in mind that the disk.sys timeout value is a global setting. If you were to set it to 1 second, this would mean that it would also have the effect of allowing only 1 second to spin up a drive which is asleep before reporting a timeout on the device.
It’s also important that this value be set high enough when using systems which are not only using SAN storage. For example, if you were to set the timeout value to 5 seconds on Windows Client operating systems where a SAN is not connected, you would likely see timeout errors which were not actually a problem, such as when a DVD or local disk are spinning up after being asleep. A good starting point to consider using is a value between 10 and 30 seconds.
Clearly, the need for a reasonable value is key, however we would strongly recommend that this be no more than 30 seconds going forward. As with any other change, this is something that should be evaluated for impact in a test environment prior to implementing this in a production environment.
What are the key things to consider when setting the disk timeout value?
The greater this value, the longer it will be before any timeout errors are surfaced by Windows. We would in general recommend setting this value low enough to meet any required SLA’s on the storage, rather than setting it too high.
To drive this point a little further, let’s review a potential scenario:
Let’s say that I have a SQL Server deployment, and my primary concern is fast I/O to the SQL database. SQL in general requires very short delays on I/O responses, and delays can translate to a delay in client applications built on top of SQL.
In fact SQL Server will generate an Application event any time an I/O takes longer than 15 seconds.
The problem with this approach, is that using the default value of 60 (or more) seconds employed by most array vendors, would prevent you from seeing any System events related to I/O timeouts unless the timeouts are for longer than 60 seconds. This can make troubleshooting extremely difficult, because on one hand you would have a SQL client application acting slow, and SQL reporting slow I/O, at the same time that there are no events from Windows. To ensure that these additional system events required for troubleshooting are not lost you would likely want to find a timeout closer to a maximum of 15 seconds that still works well under production workloads without generating excessive event log noise.
The problem with this approach, is that using the default value of 60 (or more) seconds employed by most array vendors, would prevent you from seeing any System events related to I/O timeouts unless the timeouts are for longer than 60 seconds.
This can make troubleshooting extremely difficult, because on one hand you would have a SQL client application acting slow, and SQL reporting slow I/O, at the same time that there are no events from Windows.
To ensure that these additional system events required for troubleshooting are not lost you would likely want to find a timeout closer to a maximum of 15 seconds that still works well under production workloads without generating excessive event log noise.
If you are using Fibre Channel-based storage, any occurrence of a dropped FC frame will cause all I/O to halt until recovery occurs, or until Storport has exceeded its 8 retry attempts at intervals of the Disk timeout.
So as I mentioned previously, 60 seconds will likely be far too long in this case, as it would translate to a 8 minute outage on a dropped frame.
Even if you are not using Fibre Channel storage, in most cases, a disk timeout value of 60 seconds would be too long, as it would mask timeout errors far beyond the point where a user could perceive slow access to data.
When troubleshooting issues with storage performance, it may be beneficial to set this slightly lower than normal to increase the chances of catching timeout errors.
I have received some questions on what timeout values exist for both iSCSI and MPIO. These are documented in their respective users guides. I’m including links to these documents below: