"There's nothing wrong with the SAN!"

Author: David Williams (Microsoft)
Contributor & Technical Reviewer: Matthew Robertshaw (Microsoft)

How many times have you heard the above from the Storage team, having looked at SQL's wait types and seen PAGEIOLATCH_*?

You know there's something gone wrong with your disks, because SQL's telling you so. But the guy/gal in the disk team isn't having any of it.

What next? How do you tell who's right? This is an amazingly common scenario, one which I've seen a hundred times (and have become embroiled in most of them).

This blog post will show you how to go about figuring out, from a SQL Server perspective, whether the disks really are at fault or if there's something important you've missed. This is going to be a pretty long haul but it's worth it, so get a cup of tea/coffee/local equivalent and get comfy.

First off we need to make some clarifications on terminology. This may seem academic to start with, but it's really important to get specific with your definitions. You'll see why shortly.

SQL's telling you PAGEIOLATCH_* is the major wait type. What does this really mean? It means the query tried to read a row from a page which was trying to do some physical IO (either reading or writing), and you had to wait on the latch to get to the page.

Note at this point; you're waiting for I/O to happen. The I/O stack in SQL is pretty shallow because we farm out IO requests to Windows, which in turn farms it out to the Kernel, which pushes it through a driver, which goes through the HBA (or equivalent), through the fibre, onto a SAN (probably), then through the various levels in the SAN, until it hits one or more spindles.

Hopefully at this point you're beginning to think: "Wow, that's a lot of things that can go wrong!". Indeed it is - each one of these layers can experience issues or contention. So rather than thinking "It's the disks!", we need to start thinking "It's the physical IO stack!".

Ok, so, we've got "physical IO contention". How does this redefinition help?

We can now approach troubleshooting the stack (rather than the disks) to see where the contention/problem lies. First off we need to figure out whether we're driving the stack too hard. The back end has a limit to the amount of data it can shift and at what latency, so let's make sure we're not pushing it too hard.

How hard is "too hard"? The answer may not be as simple as it first appears.

When one configures a SAN for SQL it's a good idea to run SQLIO to gauge what the physical IO stack is capable of doing in that configuration, to determine if that suits the database activity that will be occurring. If we're running a Data Warehouse we'll want to do lots of parallel scan operations on the data files (probably reads) and a fair amount of sequential writes on the logs. If we're running OLTP then we're likely doing lots of short, serial, random IO on the datafiles (reads and writes), and lots of short sequential writes to the logfiles. Same goes for TempDB.

So once the back end is correctly configured for the activity type and SQLIO has been run to check performance, we know at what throughput &/ latency the configuration tops out at.

If we ran SQLIO on the new system, and kept the output files and PerfMon output handy, we can refer to this data to find out the configuration’s maximum capabilities. We can then compare it to the current response patterns to see if the disks are being driven close to their known maximum. If we didn’t run SQLIO, then we need to just look at how the system is responding now, and judge if the system is overstressed.

PerfMon shows how the physical IO stack is responding. But what counters are best? For a start, NEVER use the Physical Disk counter unless we're connecting directly to physical disks. If a single drive in Windows maps to multiple disks then use Logical Disk. The reason for this is that many SANs hide the true number of physical disks from you, which makes conclusions confusing or downright incorrect.

These counters are your new Best Friends:

Logical Disk :: %Idle Time

Logical Disk :: Avg. Disk sec/Read

Logical Disk :: Avg. Disk sec/Write

Logical Disk :: Disk Read Bytes/sec

Logical Disk :: Disk Write Bytes/sec

SQL Server:Buffer Manager :: Page Life Expectancy

These counters should be captured over at least 20 minutes with 1 second interval for each disk, when the disks are under their peak load. I would normally expect to see a separate drive letter/mount point for System Databases, tempdb, Data files, Log files, and Backups. Each of these may have to have a different configuration to handle the differing file activity types (there’s a blog post coming up on this soon!).

"% Idle Time" tells us how hard the drive is working. If this counter is reading under 80% average or spiking under 40% (min) then the underlying disks are working too hard.

"Avg. Disk sec/Read" tells us the read latency for the drive. For the data file drive this should average under 17ms with no spikes over 25ms, for the TempDB drive it should average under 10ms with no spikes over 10ms.

"Avg. Disk sec/Write" tells us the write latency for the drive. For the datafile drive this should average under 17ms with no spikes over 25ms, for the TempDB and logfile drives it should average under 10ms with no spikes over 10ms.

"Disk Read Bytes/sec" gives us context for the above counters - it tells us the volume of reads happening on the drive. There is no "correct" value for this, but note that drives tuned for OLTP may not be able to provide high volume throughput.

"Disk Write Bytes/sec" gives us context for the above counters - it tells us the volume of writes happening on the drive. There is no "correct" value for this, but note that drives tuned for OLTP may not be able to provide high volume throughput.

"Page Life Expectancy" tells us how volatile the Buffer Pool is. If this is below threshold then we're probably doing too much reading and stressing the physical IO stack. See my previous blog on how to calculate PLE (https://blogs.msdn.com/b/mcsukbi/archive/2013/04/12/sql-server-page-life-expectancy.aspx).

Armed with the above counters we can tell if the issue that is happening can be dealt with by the SQL team, or by someone else.

If we have low idle time, high latency, and low throughput then the back end may be saturated with other people's activity, or incorrectly configured for the load type we're placing it under. We're not asking much of our drive but it's not responding as we'd expect. Check Page Life Expectancy (PLE) to see if the system needs tuning.

If we have high idle time, low latency, and high read/write throughput, we're doing a lot of work but the stack can cope. The PLE will likely be low but we need to find out what's running to cause such a large amount of write activity.

If we have high idle time, high latency, and low throughput, we're bottlenecking outside the disks. We need to check the driver levels and the HBA configuration. If these are good, check SAN fabric/node saturation levels.

There are too many specific combinations to cover here, but by now you should be getting the picture.

So in summary – if we suspect that the physical IO stack isn’t responding as we’d expect, make sure that the issue doesn’t lie a little closer to home.