In Window Server 2008 Failover Clustering, the team invested significant time into making clustering easier. In the Windows Server 2008 R2 release we have continued down that path, adding several troubleshooting enhancements.
One of the important aspects of troubleshooting a service outage is doing diligent postmortem analysis – to understand why you experienced the problem so that you can take corrective action to avoid seeing it again. A common problem can be due to third-party resource dlls which may not have had the detailed level of testing as the in-box dlls. In previous releases we offered the ability to isolate components into separate processes, and in 2008 R2 we have built in additional isolation logic, so that if a resource dll crashes, little else is affected, offering even higher availability to your mission-critical applications. The resource dll is a component is provided by the application being clustered and is a proxy between the application and the cluster. If the cluster wants to stop or start the application it will notify the resource dll, and resource dll will communicate this information to the application. The cluster does not load the resource dlls into the cluster service process, instead it loads them into the Resource Host Monitor (RHS.exe) process, which is recyclable.
Previously all resources used to run in a single RHS process by default. But this meant that if one resource crashes then the entire RHS process could fail and all resources hosted by this RHS will fail. We’ve improve our default behavior in 2008 R2 by separating our critical resources from our dlls in RHS. Now the Cluster Group (including the quorum resource) and Storage Group (including Available Storage and Clustering Shared Volumes) now all run in a single, isolated RHS process. The other resource dlls will run in one or more additional RHS processes.
There are two common reasons for seeing instability in a resource dll:
1. The resource dll itself may crash. In most cases this is caused by an access violation in the resource dll. In previous releases we took action to alert the admin of this event by reporting it to the Resource Control Manager (RCM) and exiting. RCM is a component inside the cluster service, which, upon receiving notification that the resource caused a crash, would mark this resource as “run in the separate monitor”. This offered higher availability to the cluster because this resource will be loaded in its own RHS process, and if it crashes again only that resource will be affected. In R2 we have enhanced this behavior by not only reporting the failure and isolating this resource, but additionally we report the access violation by generating a Windows Error Report (WER). WER will collect a dump file, create a problem report and will handle the report according to the policy applied on that computer which raises awareness of the issue to system administrator or Microsoft.
2. The resource dll might take too long to perform requested action, in some cases it might even deadlock. There is not effective way to detect if it is just taking long time or there is a deadlock. One way to solve this issue is to limit amount of time we are waiting for the resource to complete request, and if it does not complete in that time we would assume that the component handling this call is not in a healthy state. Some activities, such as online and offline, can take some time, so you may see the ‘pending’ state in the UI. If online is taking a long time, the resource might spawn a worker thread, and tell RHS that online call is pending, which can notify RHS that it requires more time. Once the resource comes online it will notify RHS. Offline is handled in the similar way. All other activities are simply limited by time and have to complete before RHS decides that it has timed out. In previous releases, when RHS decides that activity has timed out it will notify RCM and terminate the process. RCM will then isolate the resource in a separate RHS. In R2 we improved the logic based on whether the event is common or not. Many deadlocks are one-time events caused by a race condition, so it may not be appropriate to isolate that resource because of a single occurrence, as having too many individual RHS process can cause a slight performance impact. This will again create a WER and forward it to the appropriate destination.
In Windows Server 2008 R2 the reports can be found under Control Panel I System and Security Action Center Problem Reports. All RHS issues will be in the category “Failover Cluster Resource Host Subsystem”. The image below shows two issues. The first shows an Access Violation which is sent to WER as a Problem Report. Hooking up a debugger to the dump file would provide more details around what resource caused the problem.
The second item is generated when RHS has detected a call is taking too long. In this case, RHS explicitly calls WER to generate a problem report and provides additional information that allows the user to see details about which resource and call caused the issue without looking into the dump file. This example shows the case when the ONLINERESOURCE call to the resource “r1” of the type “FlexRes” took too long.
You can learn more about the benefits and configuration of Windows Error Reporting from the following resources:
Senior Software Development Engineer
Clustering & High-Availability