SQL SERVER 2005 RESOURCE FAILS TO COME ONLINE ON CLUSTER AFTER CHANGING THE SAN DRIVE TO ADD MORE DISK SPACE

We have observed a number of issue now where in clustered instance of sql server 2005 fails to come online after performing changing on shared disk on a cluster.

As mentioned in my previous posts if the sql server fails to come online we need to start troubleshooting by looking at the sql server error log. So we checked the sql server error log and observed the following error

2009-01-11 21:10:06.96 Server Error: 26054, Severity: 16, State: 1.
2009-01-11 21:10:06.96 Server Could not find any IP address that this SQL Server instance depends upon. Make sure that the cluster service is running, that the dependency relationship between SQL Server and Network Name resources is correct, and that the IP addresses on which this SQL Server instance depends are available. Error code: 0x103.
2009-01-11 21:10:06.96 Server Error: 17182, Severity: 16, State: 1.
2009-01-11 21:10:06.96 Server TDSSNIClient initialization failed with error 0x103, status code 0xa.
2009-01-11 21:10:06.96 Server Error: 17182, Severity: 16, State: 1.
2009-01-11 21:10:06.96 Server TDSSNIClient initialization failed with error 0x103, status code 0x1.
2009-01-11 21:10:06.96 Server Error: 17826, Severity: 18, State: 3.
2009-01-11 21:10:06.96 Server Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
2009-01-11 21:10:06.96 Server Error: 17120, Severity: 16, State: 1.
2009-01-11 21:10:06.96 Server SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

As seen from the above Error 26054 message which says that cluster service is unable to find the dependency between the sql server instance and the sql network resource name.

We checked the Cluster Administrator for the sql server resource and found that the dependency exists between the sql server resource and sql network name but still we receive the error.

Cause
=====

We checked the Registry Location HKLM\Cluster\Resources\<SQL Server GUID>\Parameters

The values InstanceName and VirtualServerName were missing from the keys below:

HKLM\Cluster\<SQL Server GUID>\Parameters
HKLM\Cluster\<SQL Agent GUID>\Parameters

The above registry keys are used by the cluster administrator to connect and start the sql server service. So the Cluster Administrator was not able to connect and start the sql server resource and hence we receive the above error

Resolution
========

IMPORTANT : This resolution contains information about modifying the registry. Before you modify the registry, make sure to back it up and make sure that you understand how to restore the registry if a problem occurs. For information about how to back up, restore, and edit the registry, click the following article number to view the article in the Microsoft Knowledge Base:

256986 Description of the Microsoft Windows Registry

We added the following registry values in the Parameters registry key:

For a named instance of SQL Server:

• InstanceName
Value Name: InstanceName
Value Type: REG_SZ
Value Data: <instance name of the named instance>

• VirtualServerName
Value Name: VirtualServerName
Value Type: REG_SZ
Value Data: <Virtual server name of the instance>

Once we added the registry keys we were able to bring the sql server online on Node1. 

We failed the instance over to Node2 we checked the above registry keys and found they were missing, so we added missing keys and rebooted both the servers of the clusters.

After rebooting the server the sql server resources came online fine without any errors

 

Note:

This post should not be treated as the Microsoft’s Recommendation or Resolution to the problem, it is only a workaround which worked in our environment and so we would like to share it.

 

Parikshit Savjani,
SE, Microsoft SQL Server