In production servers we wanted to know why SQL server is doing more IO on disks, Disks can be overwhelmed if you are requesting more than what they can provide . I recently worked on issue where customer wanted to know why SQL server is doing more IO and how to reduce it.
We have collected PSSDIAG, which is ideal in this situation because it has combination of tools in it like perfmon, profiler trace, DMV's output.
first let me give you some information about what could be the capacity limits of normal single disk and SAN
Single disk :- Avg disk bytes/sec:- 50 to 70 Mb/sec (approximate values)
SAN := Avg disk bytes/sec 150-300 Mb/sec (approximate values) . Any thing more than above limits disk are expected to show some latency and are considered overwhelmed.
In our case we had normal single disk, but SQL server was doing more than 100 mb/sec (avg disk bytes/sec) including reads and writes which is high. so We wanted to know how to reduce this.
and drive which we wanted to know is W:\
If you see above screen shot, LogicalDisk(W:)\Disk Bytes/sec is fluctuating for every min or so. since this is almost happening for every one min first thing which came into my mind is CHECKPOINTs, So added CHECKPOINT counter to performance monitor.
and graph looks like below.
From the graph we can clearly see that as in checkpoint pages/sec increases Disk bytes/sec is also increasing , So from the graph IO is directly proportional to Checkpoint pages/sec.
Checkpoint pages/sec means:- Number of pages flushed by checkpoint or other operations that require all dirty pages to be flushed.
So Checkpoint duty is to flush dirty pages from buffer pool to disk and pages in the buffer pool will become dirty as in they are modified , So more writes more dirty pages , more dirty pages more IO on disk.
This is all falling back to number of modification you are doing on databases So we need to focus on reducing the modification on the databases so that dirty pages count will reduce and IO on drive will reduce
So we started looking at top write intensive queries and found a purge process that is being executed for every minute from application , and that explains why you are seeing spikes in disk bytes/sec for every min. We reduced the amount of records that purge job deletes every min and that reduced the half of the IO request on Drive W:\
So ideal solution in this case going to be
Reduce the number of modification happening on the server (remove unnecessary writes)
Use the Disk which can provide more throughput ( like SAN drives)
Spilt the number of files on the drive to multiple drives so that load on that drive will reduce. ( we had 24 data files on this drive)
- Check the fragmentation on database and see you can defrag the databases , this might give some relief
- apply any missing indexes and see if that help.
high disk utilization can be because of multiple reason and above is one of such, hope this helps someone.