Unable to correct invalid SQL Server Network Configuration on clustered SQL Server causes clustered SQL Server fail to start “permanently”

Recently, we encountered an issue reported by a customer on a clustered SQL Server. What happens is the following.

 

The Customer needs to restrict the number of IP addresses SQL 2005 Server is listening. He uses the following procedure for that: open “SQL Server Configuration Manager” (SSCM) -> “SQL Server 2005 Network Configuration” -> “Protocols for XX” , right clicked TCP/IP choosing Protocols and set “Listen All” to “NO”. After that he wants to set the proper IP-addresses. This is working fine on a non-clustered system. On a clustered system the following happens when he is restarting SQL Server: SQL Server does not come online anymore. And the SQL Server error log reports the following.

SQL Server 2005 errorlog

2005-11-28 10:53:29.02 Server Error: 17182, Severity: 16, State: 1.

2005-11-28 10:53:29.02 Server TDSSNIClient initialization failed with error 0x32, status code 0x1c.

2005-11-28 10:53:29.02 Server Error: 17182, Severity: 16, State: 1.

2005-11-28 10:53:29.02 Server TDSSNIClient initialization failed with error 0x32, status code 0x1.

2005-11-28 10:53:29.03 Server Error: 17826, Severity: 18, State: 3.

2005-11-28 10:53:29.03 Server Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.

2005-11-28 10:53:29.03 Server Error: 17120, Severity: 16, State: 1.

2005-11-28 10:53:29.03 Server SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

Configuring the server to listening on individual IP-addresses is not supported on Clustered SQL Server, thus the failure indicated by the error log is accurate. However, the customer can NOT set “Listen ALL” to “Yes” any more to bring the server online. To be more accurate, even though the customer can set “Listen ALL” to “Yes” using SSCM on each physical node, each time he try to bring SQL Server online, the value will be overwritten to “NO” and the clustered SQL Server will end up failing. This is a serious issue that causes frustration of our customers.

The root cause of this issue is because of cluster checkpoint service behavior. If a setting is changed while the resource is online, that change will get check-pointed to the CPT hive file in the cluster quorum disk. If the resource is offline while you make the parameter change, then it will never be check-pointed. Each time you bring up the resources, the check-pointed value will overwrite the local value. SQL Server network configuration is one resource that is check-pointed. So, if you put the SQL Server resource offline while you change the “Listen All” from “NO” to “YES”, and then you try to put the resource back online, it will fail as the "local" change was overwritten (during resource startup) with what was persisted in the checkpoint file.

Because of this check-pointing behavior, any time that SQL Server network configurations are modified into invalid values while the server is online, restarting the clustered SQL Server will cause the server fail to start “permanently”.

To get out of such BAD state, one workaround is to disable the check-pointing for SQL Server network configuration, described as following.

1. While SQL Server instance is in offline/failed state, disable cluster checkpointing for network configuration by:

      cluster res "SQL Server" /removecheck:”SoftwareMicrosoftMicrosoft SQL ServerMSSQL.XXXMSSQLSERVER”

2. Correct the configuration by using SSCM. Verified the key was corrected on both nodes.

3. Bring SQL cluster back online.

4. Re-enabled cluster checkpointing for network configuration by:

      cluster res "SQL Server" /addcheck: ”SoftwareMicrosoftMicrosoft SQL ServerMSSQL.XXXMSSQLSERVER”

Note that, for named instance, the resource display name "SQL Server" should be replaced with "SQL Server (<instance name>)".

 

If the workaround does not resolve the issue described as above in your case, please let us know.

 

Do you know that you can post question w.r.t SQL Server data access, connectivty issues at https://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=87&SiteID=1?

 

Nan Tu, Software Design Engineer, SQL Protocols

Disclaimer: This posting is provided "AS IS" with no warranties, and confers no rights