Registry check-pointing on a Windows Cluster

Over the last few years that I have been in Microsoft, I have come across multiple issues where database administrators have mistakenly made inappropriate changes to the clustered SQL Server registry hives. Later, when SQL Server restarts (either due to manual intervention or during a failover), SQL Server can no longer come online on the cluster. Additionally, the registry changes cannot be reverted back; and SQL Server remains in the failed state despite all efforts to bring it online.

This phenomenon happens due to a process known as "Registry Check-pointing" which happens automatically on a cluster. This is true not only for the SQL Server Resource, but for all clustered resources. Lets first try and understand the "Registry Check-pointing" process, as this will help us understand how to fix a scenario such as this.

To understand the "Registry Check-pointing" process, lets take up a cluster. My clustered environment has 2 instances of SQL Server installed; however, I would not make any changes to these instances. What I would do, for demonstration purpose, is to create a new resource and play with that.

Lets go ahead and create a new cluster group, and a new resource in the group. We will not bring the resource online just yet. Also, for this demonstration, the resource type is not important.

NewGroup

Now, in the registry, we will create a sample registry hive, that we will later associate to this resource.

NewRegistry

Now, to associate the Registry hive we just created to the new resource, we go to HKLM\Cluster\Resources. We navigate through the GUIDs to locate the GUID that corresponds to our Resource. Under the GUID, we create a new Key and name it RegSync. Under RegSync, we create a new String Value, name it 00000001, and enter Software\MyTestResource as its value. This will actually point to the Registry Hive HKLM\Software\MyTestResource. Refer snapshot below:

RegistryAssociation

Now, let us bring the group MyTestGroup online. This will bring the resource MyTestResource also online. Now, we will edit the registry entry HKLM\Software\MyTestResource\ChildKey and modify the value of Param_2. Within a few seconds, we can see a folder being created in the Quorum_Drive\MSCS folder, having its name same as the GUID. This will contain a file 00000001.CPT. We will now, revert back the value of Param_2. At this point, we have the following:

Quorum1

Now, we will run the following tests:

  1. While the resource MyTestResource is online, change the value of Param_2 from "This is parameter 2" to "This is parameter 2 modified". Now, take the resource offline and bring it back online. Refresh the registry and you will find the value of Param_2 to remain "This is parameter 2 modified".
  2. Take MyTestResource offline, and modify the value of Param_2 from "This is parameter 2 modified" to "This is parameter 2 modified again". Bring the resource MyTestResource online. Refresh the registry and you will find the value of Param_2 revert back to "This is parameter 2 modified". So, essentially the modification that we had done while the resource was offline is lost.

So, what's happening in the background?

Here's what – The registry hives that are mentioned under the RegSync key of any clustered resource go in for automatic check-pointing when the resource is online. This means that whenever a change is made to any entry in this hive, when the corresponding resource is online, the value is automatically replicated to the other nodes of the cluster. Also, the entire registry hive is exported to a folder in the Quorum Drive. The folder name corresponds to the GUID for the resource in the Registry and the filename corresponds to the Hive Number under RegSync.

Additionally, when the resource goes offline, the check-point process is fired and the same activity is completed once again.

Now, when the resource tries to come online, here is what it does:

  1. The Cluster Service checks the registry entries on the Node where the resource tries to come online. It matches the registry entries with those on the other nodes. If there is a mismatch, the registry hive is replaced from the CPT file in the Quorum drive.
  2. If a mismatch is not detected, then the registry entries are matched with those in the CPT file. If there is a mismatch, again, the registry hive is replaced from the CPT file.

Hence, we can clearly understand that any changes made to any of the check-pointed registry hives will be automatically over-written by the contents in the CPT file.

How do we recover in case such a mistake has been done?

It is not impossible to recover the resource back from this scenario. However, care must be taken while following the steps that I will mention below; so that even by mistake, we do not commit any mistakes. Please be extra cautious, as any mistake here can lead to a complete reinstall / reconfiguration.

  1. Make sure that the resource is in the offline / failed state. In many situations, it is not uncommon for the resource to try to come online, and might be seen failing over and back between the nodes. If such a scenario occurs, right-click on the resource, and choose Properties. In the properties dialog box, choose the Advanced tab and then select Do not restart.

    Make sure that once all the steps are completed and the resource is back online, you reset the option back to Restart.

    DoNotRestart
     

  2. Now, edit the registry directly, and make sure you correct the mistakes that you had made. You will need to make the changes on ALL the nodes. Also, make sure that the information on all the nodes is exactly the same. This is a very important step and can cause issues if not done very carefully. Do NOT attempt to bring the resource just yet.

  3. Now, in the registry, navigate to HKLM\Cluster\Resources and locate the GUID that corresponds to the failed resource. Look under RegSync and locate the Hive Number that corresponds to the registry hive that you have just corrected. 
    Here's an example: Lets assume that I had corrected the value of SQLArg2 under
    HKLM\Software\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer\Parameters.
     
    CurrentRegistry 
     
    As can be seen from the screenshot above, the GUID for the resource is
    a6fdaf0e-b064-4f8e-aa90-fd3c711a11ee, and the Hive Number is 00000004.

  4. Now, in the Quorum Drive, navigate to the MSCS folder. Now, locate the folder that has a name the same as the GUID we found in the earlier step.
    In my example, I will navigate to Q:\MSCS\a6fdaf0e-b064-4f8e-aa90-fd3c711a11ee.
    In this folder, locate and rename the HiveNumber.CPT file to HiveNumber.CPT.OLD. Please rename the file instead of deleting it, so that in case of issues, we can revert back.
    In my example, I will rename 00000004.CPT to 00000004.CPT.OLD.

  5. Now, attempt to bring the resource online. If you have been able to complete the steps mentioned above carefully, and you have not made any mistakes, the resource should come online fine. If the resource comes online, a new HiveNumber.CPT file will be created, and now you can safely delete the HiveNumber.CPT.OLD file.

  6. However, if the resource does not come online, review the changes that you have made and try to figure out any mistakes. If none can be found, review the Windows Event Logs for further troubleshooting.

 

I hope this post will be useful to my readers. Please post your comments in case you find this post helpful.

Disclaimer: All information provided here is my personal opinion and is neither verified nor approved by Microsoft before it is published. All information, and code samples, if any, is provided "AS IS" with no warranties and confers no rights.