Most of us have seen these errors in the SQL Server error log pointing to a IO subsystem problem:
2015/06/15 12:43:01 spid8s SQL Server has encountered 1016 occurrence(s) of I/O requests taking longer than 15 seconds
to complete on file [N:\ABCDATA8\ [ABCDATA8.ndf] in database [ABC] (5). The OS file handle
is 0x000000000000172C. The offset of the latest long I/O is: 0x0000073234
For a customer case this message has to be analyzed in more detail, because the storage group at the customer insists to get more information as they don’t see these high numbers on the storage system at that time. A deeper investigation reveals these findings:
- the counter is collected and reported for each file separately
- the message is printed by the checkpoint, the background process that kicks in every 5 seconds
- it is printed in a minimum interval of 300 seconds (5 minutes) per file, means only one message per 5 minutes per file
- it checks the amount of ticks (milliseconds) IO requests were active and if this exceeds 15000 (15 seconds) this IO request is counted
- it is doing this for all parallel and asynchronously IO requests (per file)
The problem is that these are not the amount of IO requests that are counted, but the amount of times IO request were seen during the reporting time. Imagine you have a very long IO request going on (only one). The reporting kicks in after 5 seconds (by the checkpoint) and after 3 times (15 seconds) the IO request is the first time counted, but no message is printed yet as the five-minute minimum per file is not reached yet. The very same IO request is counted at each reporting cycle (every 5 seconds) until the 300 seconds threshold is reached and the message is printed (and the counter is reset). Until then this one IO request is counted 57 times (300 seconds / 5 seconds reporting interval = 60 occurrences – 3 time free (first 15 seconds)).
So if you have 1000 occurrences means that in this 60 reporting cycles of the 5 minutes reporting interval 1000 IO requests were seen that needed more than 15 seconds. That might were only a handful very long running IO requests (e.g. minimum is (1000 / 57) = 17.5 requests) or many shorter request (e.g. 1000 requests of 15 – 19 seconds).
The message is misleading as it is talking of occurrences and not of IO requests, but it still points to a storage problem. Note well, however, when you see even one such message it means you had at least one event where an I/O took more than 15 SECONDs to complete. This is VERY SLOW. This indicates a problematic I/O event. But there are not so many IO requests hanging as stated in the message, but at least there are some.
I hope this sheds some light on it.