How To: Troubleshooting SQL Server I/O bottlenecks

One of the most common reason for server performance issues with respect to SQL Server is the presence of an I/O bottleneck on the system. When I say I/O bottleneck, it can mean issues like slow disks, other processes hogging I/O, out-dated drivers, etc. In this blog, I will seek to outline the approach for identifying and troubleshooting I/O bottlenecks on SQL Server.

 

The Symptoms

 

The following are the most common symptoms of an I/O bottleneck on the SQL Server machine:

  • You see a lot of threads waiting on one or more of the following waits:
    • PAGEIOLATCH_*
    • WRITELOG
    • TRACEWRITE
    • SQLTRACE_FILE_WRITE_IO_COMPLETION
    • ASYNC_IO_COMPLETION
    • IO_COMPLETION
    • LOGBUFFER
  • You see the famous "I/O taking longer than 15 seconds" messages in the SQL Server errorlogs: 
    2012-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:SEDATAstressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is:0x00000000022000”.

 

Troubleshooting

 

Data Collection:

 

If you see the symptoms outlined above quite frequently on your SQL Server installation, then it will be safe to draw the conclusion that your instance is suffering from a disk subsystem or I/O bottleneck. Let's look at the data collection and troubleshooting approach pertaining to the same:

  1. Enable a custom Performance Monitor collector to capture all disk related counters. Just go to start->run, type perfmon, and hit ok. Next, go to Data Collector sets->User Defined, right click on User Defined, and click New-> Data Collector set.
    Note: The best thing about perfmon(apart from the fact that it is built into windows) is that it's a very lightweight diagnostic, and has negligible performance overhead/impact.

  2. Give the data collector set a name, and select Create manually. Under type of data, select the "Create data logs" option, and check the Performance Counter checkbox under it.

  3. Next, click on add performance counters, and select the "LogicalDisk", "Process" and "PhysicalDisk" groups, and select "All instances" for both before adding them.

  4. After you have added the counters, you can also modify the sample interval. You might want to do this if you see spikes lasting less than 15 seconds, which is the default sample interval. I sometimes use an interval of 5 seconds when I want to closely monitor an environment .

  5. Click on Finish and you will now see the new Data Collector set created under User Defined.

  6. Next, right click on the Data Collector set you just created, and click start.

 

I normally recommend my clients to run the perfmon collector set for at least one business day, so that it has captured data for the load exerted by at least one standard business cycle.

 

 

Analysis:

 

Now that we have the data, we can start the analysis. After stopping the collector set, you can open the blg file generated (the path is displayed under the output column, on the right hand side in perfmon) using perfmon (a simple double click works, as the file type is associated with perfmon by default). Once open, is should have automatically loaded all the counters. Analysing with all the counters can be a bit cumbersome, so I would suggest that you first delete all the counters and then add specific counters one by one.

I will list out the important counters here, along with their expected values: 

  1. Process->IO Data Bytes/sec: This counter represents the average amount of IO Data bytes/sec spawned by each process. In conjunction with IO Other Bytes/sec, this counter can be used to determine the average IO per second as well as the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
     

  2. Process-> IO Other Bytes/sec: This counter represents the non-data IO spawned by each process during the capture. Usually, the amount of non-data IO is very low as compared to data IO. Use the total of both IO Data Bytes and IO other bytes to determine the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
     

  3. Physical Disk/Logical Disk->Avg. Disk Sec/Read: This counter signifies the average amount of time, in ms, that it takes for a read I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
     

  4. Physical Disk/Logical Disk->Avg. Disk Sec/Write: This counter signifies the average amount of time, in ms, that it takes for a write I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.

  5. Physical Disk/Logical Disk->Disk Bytes/Sec: This counter represents, in bytes, the throughput of your I/O subsystem for each physical/logical disk. Look for the max value for each disk, and divide it by 1024 twice to get the max throughput in MB for the disk. SAN's generally start from 200-250 MB/s these days. If you see that the throughput is lower than the specifications for the disk, it's not necessarily a cause for concern. Check this counter in conjunction with the Avg Disk Sec/Read or Avg Disk Sec/Write counters (depending on the wait/symptom you see in SQL), and see the latency at the time of the maximum throughput. If the latency is green, then it just means that SQL spawned I/O that was less the disk throughput capacity, and was easily handled by the disk.

  6. Physical Disk/Logical Disk->Avg. Disk Queue Length: This counter represents the average number of I/O's pending in the I/O queue for each physical/logical disk. Generally, if the average is greater than 2 per spindle, it's a cause for concern. Please note that I mentioned the acceptable threshold as 2 per spindle. Most SAN’s these days have multiple spindles. So, for example, if your SAN has 4 spindles, the acceptable threshold for Avg Disk Queue Length would be 8.
    Check the other counters to confirm.

  7. Physical Disk/Logical Disk->Split IO/Sec: This counter indicates the I/O's for which the Operating System had to make more than one command call, grouped by physical/logical disk. This happens if the IO request touches data on non-contiguous file segments. It's a good indicator of file/volume fragmentation.

  8. Physical Disk/Logical Disk->%Disk Time: This counter is a general mark of how busy the physical/logical disk is. Actually, it is nothing more than the “Avg. Disk Queue Length” counter multiplied by 100. It is the same value displayed in a different scale. This is the reason you can see the %Disk Time going greater than 100, as explained in the KB https://support.microsoft.com/kb/310067. It basically means that the Avg. Disk Queue Length was greater than 1 during that time. If you've captured the perfmon for a long period (a few hours or a complete business day), and you see the %Disk Time to be greater than 80%, it's generally indicative of a disk bottleneck, and you should take a closer look at the other counters to arrive at a logical conclusion.
     

It's important to keep 2 things in mind. One, make sure your data capture is not skewed or biased in any way (for example, do not run a capture at the time of a monthly data load or something). Second, make sure you correlate the numbers reflected across the various counters to arrive at the overall picture of how your disks are doing.

 

Most of the time, I see that people are surprised when they are told that there are I/O issues on the system. Their typical response is "But, it's been working just fine for x years, how can it create a bottleneck now?". The answer lies within the question itself. When the server was initially configured, the disk resources were sufficient for the load on the server. However, with time, it's inevitable that the business grows as a whole, and so do the number of transactions, as well as the overall load. As a result, there comes a day when the load breaches that threshold, and the disk resources on the server are no longer sufficient to handle it. If you come to office one fine day, see high latency on the disks during normal working hours, and are sure that

  • No special/additional workloads are running on SQL
  • No other process on the server is spawning excessive I/O,
  • Nothing changed on the server in the past 24 hours (like a software installation, patching, reboot, etc.)
  • All the BIOS and disk drivers on the server are up to date,

Then it's highly likely that the load on your server has breached this threshold, and you should think about asking your disk vendor(s) for a disk upgrade (after having them check the existing system once for latency and throughput, of course). Another potential root cause that can cause high latency is that your disk drivers and/or BIOS are out of date. I would strongly recommend checking periodically for updates to all the drivers on the machine, as well as the BIOS.

 

Hope this helps. As always, comments, feedbacks and suggestions are welcome.