Failed to bring your Mission Critical SQL Server Always On Failover Cluster Online after Patching!?

In a scenario, that you successfully patched SQL Server (Service Pack/ CU) but the SQL Server Resource Group can’t be brought online, I recommend to start your PowerShell console and generate your Cluster Log (simply by Get-ClusterLog), this will create your file in @ Windows Directory>\Cluster\Reports\Cluster.Log

In the meanwhile I have also raised a connect entry @ https://connect.microsoft.com/SQLServer/feedback/details/2633114 .
This type of validations can be easily put into the product and increase the reliability of the rolling-upgrade.

Back to root cause analysis, relying on one of my favorite tools to see what registry calls (among other things…) are being made in the OS, the Process Monitor (Sysinternals tools) aka ProcMon. Started to analyze events from the RHS.exe from the time the resource couldn’t come online, the tricky part here is that to find the root cause of the issue, doesn’t have to necessarily be different from a “SUCCESS state” of a call, meaning that we will need to invest 5 minutes of our time and use our hawk eyes to find the issue around the time the resource failed to come online!

Bellow, is the error output from Cluster.Log, looking at the failure string, it is very suggestive of being related with replication(REPL) – DoREPLSharedDataUpgrade.

1

Next action:

Start ProcMon, reproduce the Issue (Bring the Cluster Resource Online), stop ProcMon, copy the file (*.PML) to another box and start analyzing the file by filtering process to RHS.EXE to reduce the noise of all activity of the server.

Humm… The repldata folder is point to a path that no longer exists…! Fix this to the proper directoy, and bring the cluster resource back online, hopefully there are no further issues and instance is successfully patched – online, Up & Running!

Regards,
Paulo Condeça.