Hi cluster fans,
This third post in our series about Failover Clustering Performance Counters will give some practical examples of how to use this new Windows Server 2008 R2 feature to help troubleshoot your cluster.
In Part 1 of this blog series we discussed Performance Counters and their interaction with the Network, Multicast Request Reply, Global Update Manager and Database clustering components. In Part 2 we looked at monitoring some additional cluster components: the Checkpoint Manager, Resource Control Manager, Resource Types, APIs and Cluster Shared Volumes. Stay tuned for the fourth part of this series where we will discuss implementing Performance Counters using PowerShell.
Example 1: Cluster Handle Leaks
On my cluster I’ve observed that the Resource handles keep going up, which is an indication of a potential handle leak. I am sure that no external clients can be connected to the node, so it must be something running on the node. When I look at the ClusterAPI Performance Monitor I observe 500 resource handles:
All calls to the cluster API are coming from the ClusterAPI and in Windows Server 2008 R2 we have made tracing in this component available to all customers. To enable this tracing you need to run Event Viewer (or eventvwr from the command line). In the Event Viewer go to the View menu and check “Show Analytics and Debug Logs”.
On the tree navigate to the “Applications and Services Log\Microsoft\FailoverClustering-Client\Diagnostic”. Right click on the Diagnostic and click on Enable Log. You may see a notification reminding you when the data will be collected, which can be ignored.
Now wait for the handle to get increased again, right click on the Diagnostic and select Disable Log. Now on the right pane you can see list of the collected traces. Select an event, switch to the Details view and find information on what process this event came from.
Now since we know the Process ID we can go to the Task Manager and find this process. In this case it happens to be PowerShell.exe, which was running a script enumerating the resources from time to time.
If you are familiar with PowerShell you probably know that it is .NET based, and if you are familiar with .NET you probably know that it uses garbage collection to lazily collect freed memory. In this case, PowerShell was just taking a while to kick off garbage collection, but once garbage collection begins, all the opened handles no longer used by the process are collected and number of handles on the server decreases.
At the end it was not a handle leak, but hopefully it gives you some ideas on how you can approach this class of issues.
Thing will become harder if the client is running remotely. In this case you might first use NetMon (http://search.microsoft.com/Results.aspx?qsc0=0&q=netmon&mkt=en-US&FORM=QBME1&l=1) or Process Monitor (http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx) to find where traffic to this node is coming from and then examine the clients to see if they run any apps that might cause the cluster handle leak.
Example 2: Overload of Cluster API Calls
On this particular 4 node cluster with 800 resource groups we have observed CPU utilization caused by the Cluster Service (clussvc.exe) at 95%. The puzzling part was that all the resources were offline and there was no known clients connected to this cluster, so we’re going to use performance counters to see if we can find out what component is causing the large CPU consumption.
Looking at the Cluster API calls we have observed that the cluster is getting hit with about 130 Resource API calls per second and about 70 Group API calls per second.
Looking at the Cluster Multicast Request Response (MRR) Messages we have observed that Messages Send Delta is around 90 MRR messages per second.
Examining the Cluster Global Update Manager Messages showed that there are no GUM updates going on, so most likely all the activity is coming from the API calls that are hitting resources not hosted on this node so the node forwards the request to the owning node using MRR.
Looking at the Cluster Network Messages confirms that there is lots of traffic passing between the nodes (see Bytes Received Delta and Bytes Sent Delta on the picture above).
This leaves two unanswered questions: What calls are being made, and who is the caller?
We will now run a Process Monitor. In the Process monitor we put a filter to show only events for the clussvc.exe registry and networking. In a minute we will stop collecting traces and look at the Network Summary and Registry Summary. The Network Summary shows us that there is no traffic besides the traffic between the nodes, so it has to be an application running on one of the nodes. Registry Summary demonstrates that something repeatedly opens group keys in the cluster database. So it looks like the caller is trying to enumerate all groups on the cluster.
We have started Event Viewer and collected ClusAPI logs as described in the previous example. This immediately pointed us to a process that was making most of API calls. By stopping this process we have confirmed that CPU consumption went down.
I hope you find this information helpful in troubleshooting issues using performance counters on your Windows Server 2008 R2 Failover Cluster.
Senior Software Development Engineer
Clustering & High-Availability