Why are those IOs taking so long?

Ever see something like this in your SQL Errorlog?

SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [T:MSSQLDATAtempdb.mdf] in database [tempdb] (2). The OS file handle is 0x00000838. The offset of the latest long I/O is: 0x000000ebdc0000

You may hear this referred to as a stalled IO. I see questions on this warning message quite a bit, mostly about what’s wrong with SQL Server and why is SQL taking so long to read or write to the disk. Let me explain why this isn’t a SQL Server problem.

 

SQL Server does data file reads and writes almost exclusively as asynchronous IO, using the win32 APIs ReadFile, WriteFile, ReadFileScatter and WriteFileGather. Each of these APIs behave in a fairly similar manner – the caller sends in a handle to the file, some memory location to read or write, the size of the block and a structure that tells the kernel how to handle the IO. In SQL Server’s case the how to handle the IO is Asynchronously, please. The call returns immediately so that the thread issuing the IO can get out of the way and make life happy for some other user who’d also like their query to return.

 

The catch here is that ordinarily the time between the Asynchronous call to read or write and the completion of the read or write should be on the order of 10ms. Of course the longer it takes for an IO to return the more noticeable a performance impact there is to end users.

 

Prior to SQL Server 2000 SP4 the only way you would be able to tell if your IOs were taking longer than expected would be to use System Monitor and watch the PhysicalDiskAvg Disk sec./read, write and transfer counter. This is a relatively acceptable method when the cause of your IO bottleneck is the latency of the physical disk, you might be surprised to find out that’s not the only thing that might slow down an IO!

 

Once a user mode application issues an IO request it’s the equivalent of putting a package in the mail, there’s nothing to do but wait. While you may not have anything to do once you’ve dropped the package the parcel carriers job has just begun, what with all the processing and labeling and transit – you get the idea. Likewise for the kernel. Once an IO transitions over to kernel mode it’s transformed into something called and IRP (interrupt request packet) and begins a trek down levels of filter drivers, virus scanners and device drivers before it finally makes its way home to a physical device.

 

Windows exposes methods for device and software manufactures to be notified of, and participate in, IOs. Filter drivers are one method of doing this, and it allows for great functionality of verifying you aren’t writing a file with a virus pattern, or backing up a file as it’s being written. The problem here is that the filter driver can hold up an IO for an extended period of time, and this time isn’t reflected in the Sysmon counter.

 

So what are you to do? You’ve put your specially crafted package in the mail, but the intended recipient still hasn’t got it.

 

The approach taken in SQL Server is to make note of the time the IO started, the offset of the IO within the file then check back a short time later to make sure all IOs have completed. If an IO is still outstanding and 15 seconds have elapsed then the warning is printed to the errorlog to alert the system administrator that something’s amiss.

 

When you see this message the first action should still be to have a look at the physical disk counters in sysmon to ensure that the disks are servicing IOs in a reasonable period of time. If those appear to fine then start looking at what filter drivers might be installed on your system, and if there are any known issues with them, or disable them if you don’t need them.