Today a lab setup of SQL 2012 SP1 + CU7 (I was using the /UpdateSource flag to have a patched setup + roll in the updated bits in one go) failed sometime before creating the SQL resource in cluster. The error message from the detail.txt file showed:
Could not add cluster access for SID ‘S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738’. Error: There was a failure to call cluster code from a provider. Exception message: The specified path is invalid.
The GUID (S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738) in this case happened to be the SQL Service SID. How did I know that? You can also do it, if you use the SC SHOWSID command:
C:\Windows\system32>sc showsid mssql$inst1
SERVICE SID: S-1-5-80-2358576813-3362504073-1364548504-537691503-200193738
This issue occurs when the setup routine fails to add the service account’s SID to the list of accounts allowed to access the Windows cluster itself. For example, in my test cluster (in a healthy state, of course) you can see the Service SID is added to the list of authorized accounts in the security descriptor of the Windows cluster itself:
But of course in my failed setup attempt, the SID was not added correctly. The big question is why?
(Another side question some of you may have is why should the service account have this permission on the cluster itself. Well, that’s another topic and another blog post – stay tuned for that!)
Digging deep: Cluster Log
Carrying on from the previous section, one helpful hint is to know that when such changes are made to the cluster’s ACL configuration, the changes are persisted to the cluster database, a copy of which is also persisted to the quorum disk resource (if one was configured.)
Now in my case, deeper troubleshooting was required to find out why the security descriptor could not be written. To find out, I dumped the cluster log (using the PowerShell cmdlet Get-ClusterLog) and found the following messages at the same time that the SQL setup failed:
00001004.00000fe0::2014/01/23-07:20:05.486 INFO [RES] Physical Disk <RealWitness>: Path W:\Cluster\ is not on the disk
00001004.00000fe0::2014/01/23-07:20:05.486 ERR [RHS] Error 161 from ResourceControl for resource RealWitness.
00000fb8.000011fc::2014/01/23-07:20:05.486 WARN [RCM] ResourceControl(STORAGE_IS_PATH_VALID) to RealWitness returned 161.
00000fb8.000011fc::2014/01/23-07:20:05.585 ERR [RCM] rcm::RcmApi::SetQuorumResource: ERROR_BAD_PATHNAME(161)' because of 'ValidateQuorumPath( pRes, quorumPath )'
W:\Cluster did not make sense initially, because this drive was supposed to be unused by this instance of SQL. I had to think of sequence of operations I had done during this setup…
Root cause found
Looking back, I remembered that I had swapped the disk quorum resource somewhere halfway in between the setup. It was earlier on the W: drive, but now on the Q: drive, which has the resource name RealWitness. So in short, the resource name for the quorum disk was correct, but the drive letter and path on that disk was incorrect. Stale information was used when the setup program tried to update the cluster configuration (which in turn would write to the quorum disk).
Important: Please note that this is the specific root cause for my specific situation. There may be other types of problems when the setup program tries to set the cluster access for the SID. One has to look at the inner exception message (which in my case was ‘specified path is invalid’) to be sure. The true root cause for other cases can typically be found by correlating to the cluster log file.
The solution, in this case was to remove the SQL bits (from the Control Panel) and nothing else but to re-run the setup program without any changes. It was a cleaner way, and the ‘moral of the story’ is never tweak cluster quorum in the middle of a SQL Setup procedure!