Cluster Shared Volume – A Systematic Approach to Finding Bottlenecks

In this post we will discuss how to find if performance that you observe on a Cluster Shared Volume (CSV) is what you expect and how to find which layer in your solution may be the bottleneck. This blog assumes you have read the previous blogs in the CSV series (see the bottom of this blog for links to all the blogs in the series).

Sometimes someone asks a question in why CSV performance does not match their expectations and how to investigate. The answer is that CSV consists of multiple layers, and the most straight forward troubleshooting approach is through a process of elimination to first remove all the layers, test speed of the disk and then start adding layers one by one until you find the one causing the issue.

You might be tempted to use copy file as a quick way to test performance. While copy file is an important workload it is not the best way to test your storage performance. Review this blog which goes into more details why it does not work well http://blogs.technet.com/b/josebda/archive/2014/08/18/using-file-copy-to-measure-storage-performance-why-it-s-not-a-good-idea-and-what-you-should-do-instead.aspx. It is important to understand copy file performance that you can expect from your storage so I would suggest to run copy file after you are done with micro benchmarks as a part of workload testing.

To test performance you can use DiskSpd that is described in this blog post http://blogs.technet.com/b/josebda/archive/2014/10/13/diskspd-powershell-and-storage-performance-measuring-iops-throughput-and-latency-for-both-local-disks-and-smb-file-shares.aspx.

When selecting file size you will run the tests on be aware of the caches and tiers on your storage. For instance a storage might have cache on NVRAM or NVME. All writes that go to fast tier might be very fast, but then once you used up all the space on the cache you will have to go with the speed of the next slower tier. If your intention is to test cache then create a file that fits into the cache, otherwise create file that is larger than the cache.

Some LUNs might have some offsets mapped to SSDs while others map to HDDs. An example would be tiered space. When creating a file be aware what tier the blocks of the files are located on.

Additionally, when measuring performance do not assume that if you’ve created two LUNs with the similar characteristics you will get identical performance. If the LUN’s are not laid out on the physical spindles in a different way it might be enough to cause completely different performance behavior. To avoid surprises as you are running tests through different layers (will be described below) ALWAYS use the same LUN. Several times we’ve seen cases when someone would run tests against one LUN, and then would run tests over CSVFS with another, with what was believed to be a similar LUN. Only to observe worse results in CSVFS case and would incorrectly come to a conclusion that CSVFS is the problem. When in the end, removing disk from CSV and running test directly on the LUN was showing that two LUNs have different performance.

Sample number you will see in this post were collected on a 2 Node Cluster,

CPU: Intel(R) Xeon(R) CPU E5-2450L 0 @ 1.80GHz, Intel64
Family 6 Model 45 Stepping 7, GenuineIntel,
2 NUMA nodes 8 Cores each with Hyperthreading disabled.
RAM: 32 GB DDR3.
Network: one RDMA Mellanox ConnectX-3 IPoIB Adapter
54GBiPS, and one Intel(R) I350 Gigabit network adapter.
The shared disk is a single HDD connected using SAS.
Model HP EG0300FBLSE Firmware version HPD6. Disk cache is disabled.

 

With this hardware my expectation is that the disk should be the bottleneck, and going over the network should not have any impact on throughput.

In the samples you will see below I was running a single threaded test application, which at any time was keeping eight 8K outstanding IOs on the disk. In your tests you might want to add more variations with different queue depth and different IO sizes, and different number of threads/CPU cores utilized. To help, I have provided the table below which outlines some tests to run and data to capture to get a more exhaustive picture of your disk performance. Running all these variation may take several hours. If you know IO patterns of your workloads then you can significantly reduce the test matrix.

Queue Depth

1

4

16

32

64

128

256

Unbuffered Write-Trough

4K

sequential read
sequential write
random read
random write
random 70% reads 30 % writes

8K

sequential read
sequential write
random read
random write
random 70% reads 30 % writes

16K

sequential read
sequential write
random read
random write
random 70% reads 30 % writes

64K

sequential read
sequential write
random read
random write
random 70% reads 30 % writes

128K

sequential read
sequential write
random read
random write
random 70% reads 30 % writes

256K

sequential read
sequential write
random read
random write
random 70% reads 30 % writes

512K

sequential read
sequential write
random read
random write
random 70% reads 30 % writes

1MB

sequential read
sequential write
random read
random write
random 70% reads 30 % writes

If you have Storage Spaces then it might be useful to first collect performance numbers of the individual disks this Space will be created with. This will help set expectations around what kind of performance you should expect in best/worst case scenario from the Space.

As you are testing individual spindles that will be used to build Storage Spaces pay attention to different MPIO (Multi Path IO) modes. For instance you might expect that round robin over multiple paths would be faster than fail over, but for some HDDs you might find that they give you better throughput with fail over than with round robin. When it comes to SAN MPIO considerations are different. In case of SAN, MPIO is between the computer and a controller in the SAN storage box. In case of Storage Spaces MPIO is between computer and the HDD, so it comes to how efficient is the HDD’s firmware handling IO from different paths. In production for a JBOD connected to multiple computers IO will be coming from different computers so in any case HDD firmware need to be able to efficiently handle IOs coming from multiple computers/paths. Like with any kind of performance testing you should not jump to a conclusion that a particular MPIO mode is good or bad, always test first.

Another commonly discussed topic is what should be the file system allocation unit size (A.K.A cluster size). There is a variety of options between 4K and 64K.

 

For starters, CSVFS has no requirements for the underlying file system cluster size.  It is fully compatible with all cluster sizes.  The primary influencer for the cluster size is driven by the workload.  For Hyper-V and SQL Server data and log files it is recommended to use a 64K cluster size with NTFS.  Since CSV is most commonly used to host VHD’s in one form or another, 64K is the recommended allocation unit size with NTFS and 4k with ReFS.  Another influencer is your storage array, so it is good to have a discussion with your storage vendor for any optimizations unique to your storage device they recommend.  There are also a few other considerations, let’s discuss:

  1. File system fragmentation. If for the moment, we forget about the storage underneath the file system aside and look only at the file system layer by itself then
    1. Smaller blocks mean better space utilization on the disk because if your file is only 1K then with 64K cluster size this file will consume 64K on the disk while with 4K cluster size it will consume only 4K, and you can have (64/4) 16 1K files on 64K. If you have lots of small files, then small cluster size might be a good choice.
    2. On the other hand, if you have large files that are growing then smaller cluster size means more fragmentation. For instance in worst case scenario a 1 GB file with 4K cluster might have up to (1024×1024/4) 262,144 fragments (A.K.A runs)  while with 64K clusters it will have only (1024×1024/64) 16,384 fragments. So why does fragmentation matter?
      1. If you are constrained on RAM you may care more, as more fragments means more RAM needed to track all these metadata.
      2. If your workload generates IO larger than the cluster size, and if your do not run defrag frequent enough, and consequently have lots of fragments then workloads IO might need to get split more often when cluster size is smaller. For instance, if on average workload generates a 32K IO then in worst case scenario on 4K cluster size this IO might need to be split to (32/4) 8 4K IOs to the volume, while with 64K cluster size it would never get split. Why splitting matters? Usually when it comes to a production workload it will be close to random IO, but larger the blocks are larger throughput you will see on average so ideally we should try to avoid splitting IO if this is not necessary.
      3. If you are using storage copy offload then, some storage boxes support it only at a 64K granularity and would fail if cluster size is smaller. You need to check with your storage vendor.
      4. If you anticipate lots of large file level trim commands (this is file system counterpart of storage block UNMAP). You might care about trim if you are using thinly provisioned LUN or if you have SSDs. SSDs garbage collection logic in firmware benefits from knowing certain blocks are not being used by a workload and can be used for garbage collection. For example, let’s assume we have a VHDX with NTFS inside, and this VHDX file itself is very fragmented. When you run defrag on NTFS inside the VHDX (most likely inside VM) then among other steps defrag will do free space consolidation, and then it will issue a file level trim to reclaim the free blocks. If there are lots of free space this might be a trim for a very large block. This trim will come to NTFS that hosts the VHDX. Then NTFS will need to translate this large file trim to block unmap for each fragment of the file. If the file highly fragmented then it may take a significant amount of time. A similar scenario might happen when you delete a large file or lots of files at once.
      5. The list above is not exhaustive by any means, I am focusing on what I view as the more relevant
      6. From the File System perspective, the rule of thumb would be to prefer larger cluster size unless you are planning to have lots of tiny files, and disk space saving from the smaller cluster size is important. No matter what cluster size you choose you will be better off periodically running defrag. You can monitor how much fragmentation is affecting your workload by looking at CSV File System Split IO, and PhysicalDisk Split IO performance counters.
  2. File system block alignment and storage block alignment. When you create a LUN on a SAN or Storage Space it may be created out of multiple disks with different performance characteristics. For instance a mirrored spaces (http://blogs.msdn.com/b/b8/archive/2012/01/05/virtualizing-storage-for-scale-resiliency-and-efficiency.aspx ) would contain  slabs on many disks, some slabs will be acting as mirrors, and then the entire space address range will be subdivided in 64K blocks and round robin across these slabs on different disks in RAID0 fashion to give you better aggregated  throughput of multiple spindles.

This means that if you have 128K IO it will have to be split to 2 64K IOs that will go to different spindles. What if your File system is formatted with cluster size smaller than 64K? That means continues block in file system might not be 64K aligned. For example, if the file system is formatted with 4K clusters, and we have a file that is 128K, then my file can start at 4K alignment. If my application performs a 128K read, then it is possible this 128K block will map to up to 3 64 blocks on the storage spaces.

If your format your file system with 64K cluster size, then file allocations are always 64K aligned and on average you will see less IOPS on the spindles.  Performance difference will be even larger when it comes to writes to Parity, RAID5 or RAID6 like LUNs. When you are overwriting part of the block storage have to do read-modify-write multiplying number of IOPS that is hitting your spindles. If you overwriting the entire block then it will be exactly one IO.  If you want to be accurate then you need to evaluate what is the average block size you expect your workload to produce. If it is larger than 4K then you want FS cluster size to be at least as large your average IO size so on average it would not get split at the storage layer.  A rule of thumb might be to simply use the same cluster size as block size used by the storage layer.  Always consult your storage vendor for advice, modern storage arrays have very sophisticated tiering and load balancing logic and unless you understand everything about how your storage box works you might end up with unexpected results. Alternatively you can run variety of performance tests with different cluster sizes and see which one gives you better results. If you do not have time to do that then I recommend 64k block size.

Performance of HDD/SSD might change after updating disk or storage box firmware so it might save you time if you rerun performance tests after update.

As you are running the tests you can use performance counters described here http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx to get further insights into behavior of each layer by monitoring average queue depth, latency, throughput and IOPS at CSV, SMB and Physical Disk layers. For instance if your disk is bottleneck then latency, and queue depth at all of these layers will be the same. Once you see queue depth and latency at the higher level is above what you see on the disk that means this layer might be the bottleneck.

Run performance tests only on the hardware that is currently not used by any other workloads/tests otherwise your results may not be valid because of too much variability. You also might want to rerun each variation several times to make sure there is no variability.

Baseline 1 – No CSV; Measure Performance of NTFS

In this case IO has to traverse the NTFS file system and disk stack in the OS, so conceptually we can represent it this way:

For most disks, expectations are that sequential read >= sequential write >= random read >= random write. For an SSD you may observe no difference between random and sequential while for HDD the difference may be significant. Differences between read and write will vary from disk to disk.

As you are running this test keep an eye out if you are saturating CPU. This might happen when your disk is very fast. For instance if you are using Simple Space backed by 40 SSDs.

Run baseline tests multiple times. If you see variance at this level then most likely it is coming from the disk and it will be affecting other tests as well.  Below you can see the number I’ve collected on my hardware, the results match expectations.

Queue Depth

8

Unbuffered Write-Trough

8K

sequential read IOPS

19906

MB/Sec

155

sequential write IOPS

17311

MB/Sec

135

random read IOPS

359

MB/Sec

2

random write IOPS

273

MB/Sec

2

 

Baseline 2 – No CSV; Measure SMB Performance between Cluster Nodes

To run this test online clustered disk on one cluster node.
Assign it a drive letter – for example K:. Run test from another node over SMB using an admin share. For instance your path might look like this \\Node1\K$. In this case IO have to go over following layers

You need to be aware of SMB multichannel and make sure that you are using only the NICs that you expect cluster to use for intra-node traffic. You can read more about SMB multichannel in clustered environment in
this blog post http://blogs.msdn.com/b/emberger/archive/2014/09/15/force-network-traffic-through-a-specific-nic-with-smb-multichannel.aspx

If you have RDMA network or when your disk is slower than what SMB can pump through all channels, and you have sufficiently large queue depth then you might see Baseline 2 close or even equal to Baseline 1. That means your bottleneck is disk, and not network.

Run the baseline test several times. If you see variance at this level then most likely it is coming from the disk or network and it will be affecting other tests as well. Assuming you’ve already sorted out variance that is coming from the disk while you were collecting Baseline 1, now you should focus on variance that is causing by network.

Here are the numbers I’ve collected on my hardware. To make it easier for you to compare I am repeating Baseline 1 numbers here.

Queue Depth

Baseline 1

8

Unbuffered Write-Trough

8K

sequential read IOPS

19821

19906

MB/Sec

154

155

sequential write IOPS

810

17311

MB/Sec

6

135

random read IOPS

353

359

MB/Sec

2

2

random write IOPS

272

273

MB/Sec

2

2

 

In my case I have verified that IO is going over RDMA and network indeed almost does not add latency, but there is a difference in IOPS between sequential write with Baseline 1 which seems odd. First I’ve looked at performance counters:

Physical disk performance counters for Baseline 1

Physical disk and SMB Server Share performance counters for Baseline 2

SMB Client Share and SMB Direct Connection performance counters for Baseline 2

Observe that in both cases PhysicalDisk\Avg.Disk Queue Length is the same. That tells us SMB does not queue IO, and disk has all the pending IOs all the time. Second observe that PhysicalDisk\Avg.Disk sec/Transfer in Baseline 1 is 0 while in Baseline 2 is 10 milliseconds. Huh!
This tells me that the disk got slower because requests came over SMB!?

Next step was to record a trace using Windows Performance Toolkit (http://msdn.microsoft.com/en-us/library/windows/hardware/hh162962.aspx ) with Disk IO for both Baseline 1 and Baseline 2. Looking at the traces I’ve noticed the Disk Service time for some reason got longer for Baseline 2! Then I also noticed that when requests were coming from SMB they hit disk from 2 threads while using my test utility all requests were issued from single thread. Remember that we are investigating sequential write. Even though when running over SMB test is issuing all writes from one thread in sequential order, SMB on the server was dispatching these writes to the disk using 2
threads and sometimes writes would get reordered. Consequently IOPS I am getting for sequential write are close to random write. To verify that I reran test for Baseline 1 with 2 threads, and bingo! I’ve got matching numbers.

Here is what you would see in WPA for IO over SMB.

Average disk service time is about 8.1 milliseconds, and IO time is about 9.6 milliseconds. The green and violate colors match to IO issued by different threads. If you look close, expand table, remove thread Id from grouping and sort by Init Time you can see how IO are interleaving and Min Offset is not strictly sequential:

While without SMB all IOs came on one thread, disk service time is about 600 microseconds, and IO time is about 4 milliseconds

If you expand and sort by Init Time you will see Min Offset is strictly increasing

In production in most of the cases you will have workload that is close to random IO, and sequential IO is only giving you a theoretical best case scenario.

Next interesting question is why we do not see similar degradation for sequential read. The theory is that in case of read disk might be reading the entire track and keeping it in the cache so even when reads are rearranged the track is already in the cache and reads on average stay not affected. Since I disabled disk cache for writes, they always have to hit spindle and more often would pay seek cost.

Baseline 3 – No CSV; Measure SMB Performance between Compute Nodes and Cluster Nodes

If you are planning to run workload and storage on the same set of nodes then you can skip this step. If you are planning to disaggregate workload and storage and access storage using a Scale Out File Server (SoFS) then you should run the same test as Baseline 2, just in this case select a compute node as a client, and make sure that over network you are using the NICs that will be used to handle compute to storage traffic once you create the cluster.

Remember that for reliability reasons files over SOFS are always opened with write-through so we would suggest to always add write-through to your tests. As an option you can create a classing singleton (non SOFS) file server over a clustered disk, create a Continuously Available share on that file server and run your test there. It will make sure traffic will go only over networks marked in the cluster as public, and because this is a CA share all opens will be write-through.

Layers diagram and performance considerations in this case is exactly the same as in case of Baseline 2.

CSVFS Case 1 – CSV Direct IO

Now add disk to CSVFS.

You can run same test on coordinating node and non-coordinating node and you should see the same results. Numbers should match to the Baseline 1. The length of the code path is the same, just instead of NTFS you will have CSVFS. Following diagram represents the layers IO will be going through

Here are the number I’ve collected on my hardware, to make it easier for you to compare I am repeating Baseline 1 numbers here.

On coordinating node:

Queue Depth

Baseline 1

8

Unbuffered Write-Trough

8K

sequential read IOPS

19808

19906

MB/Sec

154

155

sequential write IOPS

17590

17311

MB/Sec

137

135

random read IOPS

356

359

MB/Sec

2

2

random write IOPS

273

273

MB/Sec

2

2

 

On non-coordinating node

Queue Depth

Baseline 1

8

Unbuffered Write-Trough

8K

sequential read IOPS

19793

19906

MB/Sec

154

155

sequential write IOPS

177880

17311

MB/Sec

138

135

random read IOPS

359

359

MB/Sec

2

2

random write IOPS

273

273

MB/Sec

2

2

 

CSVFS Case 2 – CSV File System Redirected IO on Coordinating Node

In this case we are not traversing network, but we do traverse 2 file systems.  If you are disk bound you should see numbers matching Baseline 1.  If you have very fast storage and you are CPU bound then you will saturate CPU a bit faster and will be about 5-10% below Baseline 1.

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

Queue Depth

Baseline 1

Baseline 2

8

Unbuffered Write-Trough

8K

sequential read IOPS

19807

19906

19821

MB/Sec

154

155

154

sequential write IOPS

5670

17311

810

MB/Sec

44

135

6

random read IOPS

354

359

353

MB/Sec

2

2

2

random write IOPS

271

273

272

MB/Sec

2

2

2

Looks like some IO reordering is happening in this case too so you can see sequential write numbers are somewhere between Baseline 1 and Baseline 2. All other number perfectly lines up with expectations.

CSVFS Case 3 – CSV File System Redirected IO on Non-Coordinating Node

You can put CSV in file system redirected mode using cluster UI

Or using PowerShell cmdlet Suspend-ClusterResource with parameter –RedirectedAccess.

This is the longest IO path where we are not only traversing 2 file systems, but also going over SMB and network.  If you are network bound then you should see your numbers are close to Baseline 2.  If your network is very fast and your bottleneck is storage then numbers will be close to Baseline 1.  If storage is also very fast and you are CPU bound then numbers should be 10-15% below Baseline 1.

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

Queue Depth

Baseline 1

Baseline 2

8

Unbuffered Write-Trough

8K

sequential read IOPS

19793

19906

19821

MB/Sec

154

155

154

sequential write IOPS

835

17311

810

MB/Sec

6

135

6

random read IOPS

352

359

353

MB/Sec

2

2

2

random write IOPS

273

273

272

MB/Sec

2

2

2

 

In my case numbers are matching Baseline 2, and in all cases, except sequential write are close to Baseline 1.

CSVFS Case 4 – CSV Block Redirected IO on Non-Coordinating Node

If you have SAN then you can play with LUN masking to hide this LUN from the node where you will run this test. If you are using Storage Spaces then Mirrored Space is always attached only on the Coordinator node and any non-coordinator node will be in block redirected mode as long as you do not have tiering heatmap enabled on this volume. See this blog post for more details http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx on how Storage Spaces tiering affects CSV IO mode.

Please note that CSV never uses Block Redirected IO on Coordinator node. Since on the coordinator node disk is always attached CSV will always use Direct IO. So remember to run this test on non-coordinating node.  If you are network bound then you should see your numbers are close to Baseline 2.  If your network is very fast and your bottleneck is storage then numbers will be close to Baseline 1.  If storage is also very fast and you are CPU bound then numbers should be about 10-15% below Baseline 1.

Here are the numbers I’ve got on my hardware. To make it easier for you to compare I am repeating Baseline 1 and Baseline 2 numbers here.

Queue Depth

Baseline 1

Baseline 2

8

Unbuffered Write-Trough

8K

sequential read IOPS

19773

19906

19821

MB/Sec

154

155

154

sequential write IOPS

820

17311

810

MB/Sec

6

135

6

random read IOPS

352

359

353

MB/Sec

2

2

2

random write IOPS

274

273

272

MB/Sec

2

2

2

 

In my case numbers match to the Baseline 2 and are very close to Baseline 1.

Scale-out File Server (SoFS)

To test Scale-out File Server you need to create the SOFS resource using Failover Cluster Manager or PowerShell, and add a share that maps to the same CSV volume that you have been using for the tests so far. Now your baselines will be CSVFS cases. In case of SOFS SMB will deliver IO to CSVFS on coordinating or non-coordinating node (depending where the client is connected; you use PowerShell Get-SMBWitnessClient to learn client connectivity), and then it will be up to CSVFS to deliver IO to the disk. The path that CSVFS will take is predictable, but depends on nature of your storage and current connectivity. You will need to select baseline between CSV Case 1 – 4.

If you see numbers are similar to CSV baseline then you know that SMB above CSV is not adding overhead and you can look at numbers collected for the CSV baseline to detect where the bottleneck is.  If you see numbers are lower comparing to CSV baseline then your client network is the bottleneck, and you should validate that it matches difference between Baseline 3 and Baseline 1.

Summary

In this blog post we looked at how to tell if CSVFS performance for reads and writes is at expected levels. You can achieve that by running performance tests before and after adding disk to CSV. You will use ‘before’ numbers as your baseline. Then add disk to CSV and test different IO dispatch modes. Compare observed numbers to the baselines to learn what layer is your bottleneck.

Thanks!
Vladimir Petter
Principal Software Engineer
High-Availability & Storage
Microsoft

To learn more, here are others in the Cluster Shared Volume (CSV) blog series:

Cluster Shared Volume (CSV) Inside Out
http://blogs.msdn.com/b/clustering/archive/2013/12/02/10473247.aspx

Cluster Shared Volume Diagnostics
http://blogs.msdn.com/b/clustering/archive/2014/03/13/10507826.aspx

Cluster Shared Volume Performance Counters
http://blogs.msdn.com/b/clustering/archive/2014/06/05/10531462.aspx

Cluster Shared Volume Failure Handling
http://blogs.msdn.com/b/clustering/archive/2014/10/27/10567706.aspx

Troubleshooting Cluster Shared Volume Auto-Pauses – Event 5120
http://blogs.msdn.com/b/clustering/archive/2014/12/08/10579131.aspx

Troubleshooting Cluster Shared Volume Recovery Failure – System Event 5142
http://blogs.msdn.com/b/clustering/archive/2015/03/26/10603160.aspx