In my experience in the field, I’ve been asked a lot about disk performance for SQL Server in the context of Dynamics ERPs (AX 2009, 2012 or D365 FO LBD). A question I often get is: How much IOPS do you recommend?
Let’s step back for a minute here. An ERP solution is a complex product that can include a high number modules which may or may not be used by the customer and for which the way those modules are used can differ greatly from one customer to the next. To put things into perspective let’s imagine, you’re on a trip maybe in a train or plane. You engage conversation with the person next to you and at some point, he says; Look, I’m CEO of a company XYZ (which you know nothing about), how many employees do you think I should have to run my business? Without any other information, how could you possibly answer that question. It’s kind of the same for disk requirements when someone asks you; how much IOPS do we need, and you know nothing of how they are actually using the product.
One answer that works for both and is about as much as you can say at this point is this:
You need as much as is necessary to run your daily processes in a timely manner.
Helpful right? But until you can dig more into details that’s all you can reasonably say. Also, SPOILER ALERT! IOPS is not the only thing that matters!
In this blog series, I will discuss about what to measure and how to measure it in order to assess the health of your disks and how to approach that touchy question: is this enough or do I need more? In the first part of this series, I will discuss how and what to collect. In a second part of this blog, I will dive into how to analyze the collected data. Finally, in a 3rd part of this series I will discuss how to choose the right kind of disks based on this analysis for Azure IaaS based deployments.
Collecting relevant data to analyze disk performance and needs.
While you could provide a ton of information, such as how many users are using the system, how many purchases are created in the system everyday and so on, reality is, it’s such a complex application and there is so many ways to use it, that you would still not be able to say for sure what is the capacity that is needed in a precise manner because there are just too many variables. Sure you can try to translate the answers to all those questions into, how many transactions (Selects, Inserts, Updates, Deletes) are happening on my database to get closer to an estimate, but then the IO requirements may still vary depending on factors such as, how much memory does the server have? How much of the data is being reused (aka how much of the data from the database is “sleeping” data vs hot data) because that will change how much data needs to be read from the disks which ultimately affect your disks needs.
A more pragmatic approach is assessing your running system to determine what are your current needs. If you don’t have a running system, you will want to run performance tests that simulate “real life” usage patterns and assess that.
There’re two things that we will take into consideration here.
- How am I currently using the disks?
- What is the max performance I can expect of my disks?
Assessing disk usage
1. Collecting your disk usage
Assessing the disk usage is something that can be performed very easily without having to install any additional tool.
All you need to do is collect some perfmon (Performance Monitor) counters related to disks.
Perfmon is a built-in tool available on any version of Windows be it server or client versions.
Simply go to perfmon > Data Collector Sets > User Defined and create a new Data collector set.
Give a name and choose Manual. I will not be using a template here as this will be a very simple collector set but you could opt to create a template and reuse it on different machines.
I’m interested in Performance Counter:
In the next window lets click add to choose our counters:
There are 2 options here. You can go for the physical disks or the logical disks. Both work the same in terms of counters they offer. If you have 1 disk servicing 1 drive (aka 1 disk for C, 1 disk for D) then it doesn’t matter which you choose as they will be equivalent. If you have multiple disks for servicing 1 drive or 1 disk servicing multiple drives, you’ll kind of want both as you’ll get different information out of this. Logical tends to be the most important as that is what SQL will “see” however when investigating, you might want to understand what is happening on a per disk level in which case you’ll need the physical.
As there are a lot of different counters available, and I want to keep this as simple as possible since we will be collecting on a high frequency, I will add only the counters I want for this analysis. Also, I will choose to collect these counters for every instance meaning every Drive for the purpose of the example, but if you have a drive you don’t really care about (i.e. a drive that is only used to store resources such as installation sources or investigation) feel free to exclude it. You will need at minimum, the C drive, any drive that has system files (if you moved pagefile to another drive for example), the Data, LOG, System database Drives, and the Backup Drive.
Here is the list:
Avg. Disk Bytes/Read -> Block size of read operations
Avg. Disk Bytes/Transfer -> Average Block size of read/write operations
Avg. Disk Bytes/Write -> Block size of write operations
Avg. Disk sec/Read -> Read latency
Avg. Disk sec/Transfer -> Read+Write latency
Avg. Disk sec/ Write -> Write latency
Current Disk Queue Length
Disk Bytes/sec -> Total throughput
Disk Read Byes/sec -> Read throughput
Disk Reads/sec -> Read IOPS (more like IPS)
Disk Transfers/sec -> IOPS!
Disk Writes/sec -> Write operations/sec (more like OPS)
Repeat for Physical disks and we are set.
Now, very important, I will change the collection frequency to 1 sec!
Be aware that collecting as frequently will make the file where the collected data is stored grow quite a lot and that while the impact of this collection is overall pretty low, it’s not 0 and is it more impactful than collecting say every 30 seconds.
However, for our purposes, a collection of 30 seconds interval would be way too imprecise to really assess and understand what is going on. Given how fast things happen, a collection every 1 second is kind of already stretching it.
Next choose where you want the collection file to be stored:
Make sure this drive has multiple GB of space available and let’s hit finish.
We now have a counter that can easily be started or stopped using the start/stop buttons:
Now, as I mentioned this is collecting heavily (every 1 second) so you will want to run this collection during very targeted periods of the day.
I.e. You have a peak usage between 11 AM and noon? Run it for 1 hour only during that time frame.
You have a critical batch running at night starting at 1 AM that must finish before 4 AM? Run it for that period only.
You can easily automate the start/stop of this counter by right clicking on the “Disk Performance” above > Properties and playing with schedule (when it should start) and stop condition (when it should end).
For more information setting up perfmon counters, you can check out the following blog:
Collecting your disk max capacity
Another thing you might want to know is what capacity can your disks sustain. This is pretty important because you’ll want to understand if the data you collected previously is representative of what you really need or if you simply hit the max the disks can sustain but actually need more performance.
A useful tool in the past was SQLIO. This tool has been retired and has been replaced by a more generic and flexible tool called DiskSPD. This is a free tool downloadable that you can get from https://aka.ms/DiskSpd
You can also find valuable information on how to use this tool here:
This tool is a command line-based tool that can be run to stress your disks. It operates by creating a file on the partition you want to assess and throwing usage patterns at it. The cool thing about it is that you can specify the size of the file to use for the test, specify the block size of your IO operations, and the distribution between read and write operation.
This is very interesting because often when you get specs from disks (especially for SAN), it’s a little blurry. A vendor may claim X amount of IOPS … but not provide you with the throughput. Or maybe the SAN has very good read capabilities via caching and read acceleration but doesn’t reach that “IOPS” limit if you’re doing mostly write operations. Often that “max” IOPS doesn’t specify if its mostly read, mostly write of what the distribution is between both to achieve this “max”.
An example of a use case of this tool is by running the following command:
DiskSpd.exe -c50G -d300 -r -w40 -t8 -o8 -b64K -h -L D:\Diskspd-v2.0.17\testfile.dat > D:\Diskspd-v2.0.17\DT-RW-64k-Result.txt
In this command there are multiple variables that you need to be aware of.
-c50G -> This is the size of the test file. The file will be created by the tool so make sure to have enough disk space for this. In this example we are creating a 50 Gb file. For simulating SQL data file you will want to use a pretty big file as read write operations performance on a large file may differ from a small file.
-d300 -> This is the duration in seconds of the test. In this case 300 seconds
-w40 -> This is the read write distribution. In this case 40% write (obviously 60% read). Omitting this will default to 100% read operations. We will see in part 2 of the series how to choose this parameter for a relevant test.
-t8 -> This is the number of threads used by the tool. This should not be more than total number of CPU available and if CPU is > 8 there is little point in going over 8. I recommend using a value equal to the number of CPU up to 8 and 8 past that.
-b64k -> This is the block size of the IO operations. In this case 64 Kb is what this test will do. We will see in part 2 of the series how to choose this size for a relevant test.
-L D:\Diskspd-v2.0.17\testfile.dat -> This is where the “test” file used by the tool will be created (or reused if already existing)
> D:\Diskspd-v2.0.17\DT-RW-64k-Result.txt -> This is where the results of the test are stored for analysis
This concludes the first part of this series. In this part, we’ve seen how to set up and perfmon to collect data on your current usage. We’ve also discussed about which tool can be used to measure the max performance of your current disks.
Hopefully, this blog has been helpful in providing you with the tools needed to measure the usage of your disks and how to configure and set them up. In the next series of this blog we will discuss how to analyze the data that has been collected.
Premier Field Engineer