Hi Cluster Fans,
In Part 1 of this blog series we discussed Performance Counters and their interaction with the Network, Multicast Request Reply, Global Update Manager and Database clustering components. This post will look at monitoring some additional cluster components: the Checkpoint Manager, Resource Control Manager, Resource Types, APIs and Cluster Shared Volumes.
Checkpoint Manager is a component that helps you to make sure that data of the clustered application is available on all the nodes. Failover Cluster supports two kinds of checkpoints.
Crypto Checkpoints allow you to keep your secret protected and available on all the nodes. The secret is used to create protected containers, generate keys in the containers, and encrypt data using the keys. If your application uses Crypto API and keeps secrets in the crypto container then you can associate a Crypto Checkpoint with your resource, providing it with information about the crypto provider and the container name. The cluster will export the keys from this container, export the container with data and will save all of this information to the cluster database. After offlining the resource on a node, the Checkpoint Manager will update the snapshot. Before onlining the resource on the other node, Checkpoint Manager will create/update the crypto keys and the container with the data on that node from the snapshot. Crypto Checkpoints performance counters allow you to monitor how often checkpoints save and restore operations are happening. The Delta counters tell you how many checkpoints have happen since the last sampling period, which is configurable through the properties of each performance counter. If sampling period is 1 second then it tells you how many checkpoints have happen during the last second.
The Registry Checkpoints have similar idea. Let’s say your application keeps some data in the HKEY_LOCAL_MACHINE registry hive and you do not want to or cannot move this data to the cluster database, but you still need the data to be available on all nodes in the cluster. In that case, you can associate a registry checkpoint with your resource and the cluster will make sure that before your resource is online on the other node the registry data is replicated to that node. For the registry checkpoint we have one additional feature. We monitor the registry key for changes, and if we are notified of a change we will make a new snapshot of the data from this registry key.
You can monitor the Registry checkpoints using the Registry Checkpoint Restored, Registry Checkpoint Restored Delta, Registry Checkpoint Saved and Registry Checkpoint Saved Delta performance counters. Remember that the checkpoint is saved every time registry has changed, but restoring will happen only before the resource is onlined on the other node.
Resource Control Manager
Resource Control Manager (RCM) is a component responsible for monitoring resource state and handling resource failures. This component also makes a decision about placing a resource in a separate Resource Host Monitor (RHS) if this resource is observed to be unstable and causing RHS crashes.
Groups Online – tells you how many groups are currently online on this node.
RHS Processes – tells you how many Resource Host Monitor processes are running on this node.
RHS Restarts – tells you how many Resource Host Monitor failures have happen on this node. A failure might be cause by one of the resources causing a crash or taking too long to perform an operation.
It would be great if we can expose information about every resource and/or group, but since we support thousands of them it is not practical to do this. However we do want to have some visibility into how resources behave. A sensible way to aggregate information about resources is to do that by resource type.
On the picture above each column represents a resource type. Some of the more interesting ones are described below:
There is one special entry _Total that is an aggregation of all resource types.
Resource Controls and Resource Controls Delta tell you how many resource controls the resources of the given type are handling on this node.
Resource Failure tells you how many times a resource of this type caused the Resource Host Monitor to get terminated due to a failure of a resource of this type.
If you see that RHS is getting restarted often, then looking at these counters can tell you what resource type is having issues.
Resource Type Controls and Resource Types Controls Delta tell you how many resource type controls the resource DLL of the given type is handling on this node.
Resources Online performance counter tells you how many resource of the given type are online on this node.
APIs are programmable functions in the cluster and many of the activities in the cluster are triggered by the external API calls. Out of all the clustering counters you might end up using this one the most.
The first broad category of counters is Cluster API Calls. These counters tell you what is the incoming API call rate to the cluster service. If you run cluadmin, cluster.exe or a PowerShell command, they all call into the API so you will see one of the counters going up.
All the counters in this group tell you how many calls of the given type have happened during the past sampling interval. If sampling interval is one second, then it will show the number of calls per second.
The second group of counters shows how many handles of the given type are opened. Please note that cluster handles have nothing to do with the operating system handles (kernel handles). These are just some internal structures for cluster to keep information what object they are referring to and how this object was opened. These handles still consume some amount of memory and if something is leaking them, then cluster service might eventually run out of memory. If you are writing your own resource dll then these counters are an excellent way to make sure that your resource does not leak any cluster handles. Delta versions of counters tell you the rate with which the new handles are getting opened.
Cluster Shared Volumes
Cluster Shared Volumes (CSV) is a new storage architecture in Windows Server 2008 R2 Failover Clustering which functions as a distributed-access file system optimized for Hyper-V Virtual Machines. We have added performance counters for this new technology.
This is the only counter set that is not exposed from the cluster service. It is exposed by our CSV kernel component. You can monitor each CSV volume separately with the 3 categories of performance counters.
“IO *” tells you how many read and write IO operations went directly to the disk from this node. If the disk is local then this is expected anyways, but if disk is mounted on another node then this tells you how much SMB traffic CSV has helped us to avoid.
“Redirected *” tells you how many read and write IO operations we had to send over SMB.
“Metadata *” tells you how many metadata operations was sent to the node that owns the disk. Please note that metadata is always sent to the disk owner.
Our next post in the series will walk you through some practical examples of how you can use these tools to troubleshoot your cluster.
Senior Software Development Engineer
Clustering & High-Availability