“OnlineThread: SQL Cluster shared data upgrade failed” error (ID 19019) installing Cumulative Update in SQL Server cluster

Last week I was working with a customer on this obscure SQL Server error right after installing Cumulative Update 9 (CU9) on a SQL Server 2008 two-node cluster. As you will see, the root-cause of the problem was easy to understand, but getting to the point of finding this root-cause was not.

In this case the customer was running a virtual instance of SQL Server 2008 SP1; this instance was running with no errors and moving back and forth the clustered resources was not causing any problem. The customer was installing CU9 to avoid a database issue that was affecting the performance of a CRM system. Following best practices in SQL Server 2008/R2, she was installing the CU in the passive node first, in order to minimize the amount of downtime in the production CRM system. The problem was that trying to start the SQL Server service on the updated cluster node was failing immediately.

We looked into the Windows Event log and Cluster log but the information from these two sources did not provide any clue:

Application Event log:

10/09/2010 08:08:15 PM Error NODE1 19019 MSSQL$SQL2008 Failover N/A [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed (status 0, Worker retval = 3)
10/09/2010 08:08:11 PM Error NODE1 19019 MSSQL$SQL2008 Failover N/A [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed (status 0, Worker retval = 3)
10/09/2010 08:08:10 PM Error NODE1 19019 MSSQL$SQL2008 Failover N/A [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed (status 0, Worker retval = 3)
10/09/2010 07:56:26 PM Error NODE1 19019 MSSQL$SQL2008 Failover N/A [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed (status 0, Worker retval = 3)
10/09/2010 07:55:26 PM Error NODE1 19019 MSSQL$SQL2008 Failover N/A [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed (status 0, Worker retval = 3)

Cluster log:

00000d78.000010dc::2010/09/14-16:50:34.060 ERR [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed (status 0, Worker retval = 3)
00000d78.000010dc::2010/09/14-16:50:34.060 ERR [RHS] Online for resource SQL Server (SQL2008) failed.
000009dc.000011cc::2010/09/14-16:50:34.060 INFO [RCM] HandleMonitorReply: ONLINERESOURCE for 'SQL Server (SQL2008)', gen(0) result 5018.
000009dc.000011cc::2010/09/14-16:50:34.060 INFO [RCM] TransitionToState(SQL Server (SQL2008)) OnlinePending-->ProcessingFailure.
000009dc.00001168::2010/09/14-16:50:34.060 ERR [RCM] rcm::RcmResource::HandleFailure: (SQL Server (SQL2008))
000009dc.00001168::2010/09/14-16:50:34.060 INFO [RCM] resource SQL Server (SQL2008): failure count: 1, restartAction: 2.

The error was preventing SQL Server to create and write any ERRORLOG file so we couldn’t rely in this log either. In these kind of situations Sysinternals Process Monitor (Procmon) is usually a good option so we run this utility while reproducing the service start failure. After capturing Procmon, we filtered the trace by the rhs.exe process which is the Resource Health Check process responsible of loading SQSRVRES.DLL running the IsAlive check. Through this test we could see this DLL was detecting that the SQL Server component required an updated because the PatchLevel and SharedDataPatchLevel on Windows Registry were different. The error was taking place during this update:

image

So Procmon was useful telling us what was failing but not telling us why it was failing. Working with my colleague João Loureiro I learnt that by modifying the following registry key to “1” (one), the Windows cluster.log file could provide a more verbose output:

HKLM\Cluster\Resources\<ID for SQL Server Instance>\Parameters\VerboseLogging

After modifying this key, we failed the updated node again and check one more time to the cluster.log file, this time the information in the file was the following:

000014c8.00001eb0::2010/10/06-16:34:54.991 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: enter; VirtualServerName=NETNAME; InstanceName=SQL2008
000014c8.00001eb0::2010/10/06-16:34:54.991 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: ServiceName=MSSQL$SQL2008
000014c8.00001eb0::2010/10/06-16:34:55.022 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] ClusterResourceControl, found the network name: NETNAME.
000014c8.00001eb0::2010/10/06-16:34:55.022 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: ServerName=NETNAME\SQL2008
000014c8.00001eb0::2010/10/06-16:34:55.054 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Calling SQLClusterResourceWorker::WaitForCompletion (200)
000014c8.000017f0::2010/10/06-16:34:55.054 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Worker thread starting ...
000014c8.000017f0::2010/10/06-16:34:55.054 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Entering SQLClusterSharedDataUpgradeWorker thread.
000014c8.000017f0::2010/10/06-16:34:55.054 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): SqlDataRoot = 'R:\MSSQL10.SQL2008\MSSQL', CurrentPatchLevel = '10.1.2789.0', SharedDataPatchLevel = '10.1.2531.0'
000014c8.000017f0::2010/10/06-16:34:55.054 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): FTDataPath = 'R:\MSSQL10.SQL2008\MSSQL\FTData', SQLGroup = 'S-1-5-80-2586167408-2258694634-1203600018-1002233561-722790735'
000014c8.000017f0::2010/10/06-16:34:55.054 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Entering DoSQLDataRootApplyACL (R:\MSSQL10.SQL2008\MSSQL\Data).
000014c8.000017f0::2010/10/06-16:34:55.054 WARN [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): DoSQLDataRootApplyACL : Failed to create directory tree at SQLDataRoot.
000014c8.000017f0::2010/10/06-16:34:55.069 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Exiting DoSQLDataRootApplyACL (3).
000014c8.000017f0::2010/10/06-16:34:55.069 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Exiting SQLClusterSharedDataUpgradeWorker thread (3).
000014c8.000017f0::2010/10/06-16:34:55.069 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Worker thread exiting (retval = 3) ...
000014c8.00001eb0::2010/10/06-16:34:55.069 INFO [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] Worker Thread (14FB7D0): Calling SQLClusterResourceWorker::WaitForCompletion (4294967295)
000014c8.00001eb0::2010/10/06-16:34:55.069 ERR [RES] SQL Server <SQL Server (SQL2008)>: [sqsrvres] OnlineThread: SQL Cluster shared data upgrade failed (status 0, Worker retval = 3)
000014c8.00001eb0::2010/10/06-16:34:55.069 ERR [RHS] Online for resource SQL Server (SQL2008) failed.

Here we could see much more detailed information. After reviewing this log with the help of the customer we did find that the R: drive was indeed non-existent on the cluster. Only then we were able to understand why the SQLClusterSharedDataUpgradeWorker thread was failing. Checking the following registry key:

HKLM\Software\Microsoft\Microsoft SQL Server\MSSQL10.SQL2008\Setup\SQLDataRoot

we found drive R: listed instead of the actual SQL Server Data root drive. From this point fixing the problem was a matter of changing this registry key one cluster node at a time right after installing the required CU. It is clear than the problem was not related to this CU9 specifically, using any other CU or SP would have probably lead to a similar situation.

Neither the customer nor us were able to understand how this path could have been changed. We can only imagine some ALTER DATABASE statement was run at certain point, leaving a wrong registry entry for the SQLDataRoot key.