How To: Troubleshooting SQL Server I/O bottlenecks


One of the most common reason for server performance issues with respect to SQL Server is the presence of an I/O bottleneck on the system. When I say I/O bottleneck, it can mean issues like slow disks, other processes hogging I/O, out-dated drivers, etc. In this blog, I will seek to outline the approach for identifying and troubleshooting I/O bottlenecks on SQL Server.

 

The Symptoms

 

The following are the most common symptoms of an I/O bottleneck on the SQL Server machine:

  • You see a lot of threads waiting on one or more of the following waits:
    • PAGEIOLATCH_*
    • WRITELOG
    • TRACEWRITE
    • SQLTRACE_FILE_WRITE_IO_COMPLETION
    • ASYNC_IO_COMPLETION
    • IO_COMPLETION
    • LOGBUFFER
       
  • You see the famous “I/O taking longer than 15 seconds” messages in the SQL Server errorlogs: 
    2012-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:SEDATAstressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is:0x00000000022000”.

 

Troubleshooting

 

Data Collection:

 

If you see the symptoms outlined above quite frequently on your SQL Server installation, then it will be safe to draw the conclusion that your instance is suffering from a disk subsystem or I/O bottleneck. Let’s look at the data collection and troubleshooting approach pertaining to the same:

  1. Enable a custom Performance Monitor collector to capture all disk related counters. Just go to start->run, type perfmon, and hit ok. Next, go to Data Collector sets->User Defined, right click on User Defined, and click New-> Data Collector set.
    Note: The best thing about perfmon(apart from the fact that it is built into windows) is that it’s a very lightweight diagnostic, and has negligible performance overhead/impact.

  2. Give the data collector set a name, and select Create manually. Under type of data, select the “Create data logs” option, and check the Performance Counter checkbox under it.

  3. Next, click on add performance counters, and select the “LogicalDisk”, “Process” and “PhysicalDisk” groups, and select “All instances” for both before adding them.

  4. After you have added the counters, you can also modify the sample interval. You might want to do this if you see spikes lasting less than 15 seconds, which is the default sample interval. I sometimes use an interval of 5 seconds when I want to closely monitor an environment .

  5. Click on Finish and you will now see the new Data Collector set created under User Defined.

  6. Next, right click on the Data Collector set you just created, and click start.

 

I normally recommend my clients to run the perfmon collector set for at least one business day, so that it has captured data for the load exerted by at least one standard business cycle.

 

 

Analysis:

 

Now that we have the data, we can start the analysis. After stopping the collector set, you can open the blg file generated (the path is displayed under the output column, on the right hand side in perfmon) using perfmon (a simple double click works, as the file type is associated with perfmon by default). Once open, is should have automatically loaded all the counters. Analysing with all the counters can be a bit cumbersome, so I would suggest that you first delete all the counters and then add specific counters one by one.

I will list out the important counters here, along with their expected values: 

  1. Process->IO Data Bytes/sec: This counter represents the average amount of IO Data bytes/sec spawned by each process. In conjunction with IO Other Bytes/sec, this counter can be used to determine the average IO per second as well as the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
     
  2. Process-> IO Other Bytes/sec: This counter represents the non-data IO spawned by each process during the capture. Usually, the amount of non-data IO is very low as compared to data IO. Use the total of both IO Data Bytes and IO other bytes to determine the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
     
  3. Physical Disk/Logical Disk->Avg. Disk Sec/Read: This counter signifies the average amount of time, in ms, that it takes for a read I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
     
  4. Physical Disk/Logical Disk->Avg. Disk Sec/Write: This counter signifies the average amount of time, in ms, that it takes for a write I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.  

  5. Physical Disk/Logical Disk->Disk Bytes/Sec: This counter represents, in bytes, the throughput of your I/O subsystem for each physical/logical disk. Look for the max value for each disk, and divide it by 1024 twice to get the max throughput in MB for the disk. SAN’s generally start from 200-250 MB/s these days. If you see that the throughput is lower than the specifications for the disk, it’s not necessarily a cause for concern. Check this counter in conjunction with the Avg Disk Sec/Read or Avg Disk Sec/Write counters (depending on the wait/symptom you see in SQL), and see the latency at the time of the maximum throughput. If the latency is green, then it just means that SQL spawned I/O that was less the disk throughput capacity, and was easily handled by the disk.

  6. Physical Disk/Logical Disk->Avg. Disk Queue Length: This counter represents the average number of I/O’s pending in the I/O queue for each physical/logical disk. Generally, if the average is greater than 2 per spindle, it’s a cause for concern. Please note that I mentioned the acceptable threshold as 2 per spindle. Most SAN’s these days have multiple spindles. So, for example, if your SAN has 4 spindles, the acceptable threshold for Avg Disk Queue Length would be 8.
    Check the other counters to confirm.

  7. Physical Disk/Logical Disk->Split IO/Sec: This counter indicates the I/O’s for which the Operating System had to make more than one command call, grouped by physical/logical disk. This happens if the IO request touches data on non-contiguous file segments. It’s a good indicator of file/volume fragmentation.

  8. Physical Disk/Logical Disk->%Disk Time: This counter is a general mark of how busy the physical/logical disk is. Actually, it is nothing more than the “Avg. Disk Queue Length” counter multiplied by 100. It is the same value displayed in a different scale. This is the reason you can see the %Disk Time going greater than 100, as explained in the KB http://support.microsoft.com/kb/310067. It basically means that the Avg. Disk Queue Length was greater than 1 during that time. If you’ve captured the perfmon for a long period (a few hours or a complete business day), and you see the %Disk Time to be greater than 80%, it’s generally indicative of a disk bottleneck, and you should take a closer look at the other counters to arrive at a logical conclusion.
     

It’s important to keep 2 things in mind. One, make sure your data capture is not skewed or biased in any way (for example, do not run a capture at the time of a monthly data load or something). Second, make sure you correlate the numbers reflected across the various counters to arrive at the overall picture of how your disks are doing.

 

Most of the time, I see that people are surprised when they are told that there are I/O issues on the system. Their typical response is “But, it’s been working just fine for x years, how can it create a bottleneck now?”. The answer lies within the question itself. When the server was initially configured, the disk resources were sufficient for the load on the server. However, with time, it’s inevitable that the business grows as a whole, and so do the number of transactions, as well as the overall load. As a result, there comes a day when the load breaches that threshold, and the disk resources on the server are no longer sufficient to handle it. If you come to office one fine day, see high latency on the disks during normal working hours, and are sure that

  • No special/additional workloads are running on SQL
  • No other process on the server is spawning excessive I/O,
  • Nothing changed on the server in the past 24 hours (like a software installation, patching, reboot, etc.)
  • All the BIOS and disk drivers on the server are up to date,


Then it’s highly likely that the load on your server has breached this threshold, and you should think about asking your disk vendor(s) for a disk upgrade (after having them check the existing system once for latency and throughput, of course). Another potential root cause that can cause high latency is that your disk drivers and/or BIOS are out of date. I would strongly recommend checking periodically for updates to all the drivers on the machine, as well as the BIOS.

 

Hope this helps. As always, comments, feedbacks and suggestions are welcome.

 

 

Comments (27)

  1. Harsh Chawla says:

    Good going man!! nice post!!

  2. samson says:

    nice post . Very informative

  3. Thanks for appreciating Samson…!!!

  4. karan Gulati says:

    Good one….

  5. Swaroop says:

    Very Nice Post..

  6. Thanks for appreciating Swaroop…!!!!

  7. Ishikha says:

    Good One. Easy to understand

  8. Thanks a lot Ishikha for your kind words…!!!

  9. Sylvia Moestl Vasilik says:

    Beautiful – very useful and easy to understand!

  10. Thanks a lot Sylvia…!!!!

  11. Maggie says:

    I am trying to determine if a physical SQL cluster is necessary or a pair of VM SQL server with mirror will be sufficient for a new deployment.  After some calculation,  the expected load will be about 300KB/second in average. In peek time it may be double or triple.  Any suggestion is appreciated.

  12. OracleMan Consulting says:

    outstanding! i am learning SQL Server and this really helps

  13. Balwinder Singh says:

    Very good article.

    I/O bottle neck may reduce after update statistics and rebuild index.

    How to update statistics and rebuild index for database size 2TB and   order of doing the activity

  14. Ansar says:

    Great article harshdeep, thank you.

    can we say if the disk io has a bottle neck to one physical disk will that affect total server IO performance?

    like SQL is hosted on SAN disks ans C: on local disk

    if SQL disks has high IO and high disk time, will that cause responce time for C:?

    Thank you

  15. Hi Maggie…unfortunately, the answer is it depends. It depends on the hardware capacity, the I/O patterns and trends, and on the application code. I would strongly recommend that you test whatever solution you use before deploying it in production.

  16. Thanks Oracleman…happy to know you found the blog helpful.

  17. Hi Balwinder… rebuilding indexes helps with I/O bottlenecks because the pattern for I/O is more contiguous, as opposed to the disk head having to move round due to fragmentation on the index before the rebuild. Updating statistics simply helps the optimizer make a better decision with regards to the execution plan and has no impact on the i/o bottleneck.

    To answer your question, you should first make sure that thew hardware is capable of supporting the I/O load (take the I/O patterns and I/O rate into account), within the thresholds that you define. If you are using SQL 2014, you can partition the table and the index, and rebuild the index for each partition. This will allow you to stagger the rebuilds, and only 1 partition will be impacted at any point in time. Hope this helps. Thanks.

  18. Hi Ansar…thanks for appreciating.

    A disk bottleneck on any disk on the system can impact the overall system performance due to a cascading effect, but it should not cause the response time/disk latency for the other disk(s) to go up. Hope this helps.

  19. Andre says:

    Hi HarshDeep_Singh

    I currently have a problem of Application hang. My application won't display its UI when i click the menu in just displaying white screen (like endless looping). This problem usually solved after restarting the sql service, but happening again next day.

    I've checked all the deadlocked process and its fine. I wonder if there is caused by bottleneck effect

  20. Hi Andre…thanks for reaching out. It could be due to bottleneck, but could also be due to other factors such as contention for resources at the OS level, leaked (orphaned) transactions holding locks, etc. it would require detailed troubleshooting from both the app side and SQL side to nail down the exact cause.

  21. Brijesh says:

    Hi Harshdeep,

    With help of your post i tried troubleshooting the "SQL Server has encountered 2 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file" issue and got Avg. Disk Sec/Write as .080 and more at random times.

    Then i checked the space issues on the Drive where i have enabled database logging, it was 20% free;

    CPU and memory is also good still i am getting same errors of I/O taking longer than 15 seconds.

    After researching more on this i got to know that:

    On computers with multiple CPUs, the CPUs are designed to “sleep” during periods of low workload.  When CPUs sleep, SQL may not accurately determine CPUs overall workload and incorrectly report this as IO WARNING shown above. I need to confirm from you if my server power option is set to " Turn Off Display to 10 Minutes" can this lead to this I/O issues.

    Thanks

    Brijesh

  22. Hi Brijesh,

    Thanks for expressing interest in the blog. I believe you need to run perfmon for a bit more, and check the avg disk sec/read and avg disk sec/transfer counters as well. Writes are generally faster than reads on most SAN's, though the exact details may vary. You should also check disk bytes/sec to see the amount of throughput you're getting.

    I believe you are facing an I/O bottleneck, and disk free space and power options should not have anything to do with it.

    Hope this helps.

  23. Sudipto says:

    Hi HarshDeep

    Very good explanation and approach. Thanks for this concise blog.

  24. hi says:

    may i know what are the scale that i should use for average sec disk/write and average disk queue length..

    also one more question, how do i know how many spindle my server/vm have..?thanks in advance.

  25. Bharat says:

    We are observing Avg disk read / sec ( random spikes > 200 ms), Avg disk write / sec ( random spikes > 500 ms) and Disk Transfer / sec ( random spikes > 200) during ETL jobs bulk migration.

    Please tell me if Disk I/O is performance bottleneck in this case?