Hello Cluster fans, I am back for another go at it. This one you will need to hark back to the days of Windows 2003 Server Clusters. Many of you still have this running and may have seen Access Denied issues with node joins, the opening of Cluster Administrator, and Cluster Service failures. We seem to getting these type problems in bunches and decided it was time to blog about it. There are three different errors I will cover, but they all have the same cause.
Before beginning, I have to add the sales pitch to look into upgrading to Windows 2008 R2 Failover Clustering. There are numerous updates, fixes, and enhancements over Windows 2003 Server Clustering.
The first issue is simply opening Cluster Administrator and getting an access denied error.
When opening Cluster Administrator using the name of the Cluster (default), it must do some authentication as it is connecting via an RPC call to the Cluster Name. When opening up Cluster Administrator using the period (.), it is making a local LPC call to itself, so the authentication method is different and will go in fine.
The second issue you can see is a failure of a node joining the Cluster. In the System Event Log, you will see these events:
Event ID: 1009
Description: Cluster service could not join an existing server cluster and could not form a new server cluster. Cluster service has terminated.
Event ID: 7031
Source: Service Control Manager
Description: The Cluster Service service terminated unexpectedly. It has done this x time(s). The following corrective action will be taken in 960000 milliseconds: Restart the service.
If you look in the Cluster Log of the node failing to join, you will see this:
INFO [CS] Cluster Service started – Cluster Node Version 4.3790 INFO OS Version 5.2.3790 – Service Pack 2 (ADS 03000112L) INFO [CS] Service Starting… *** lines removed *** INFO [INIT] Attempting to join cluster CLUSTER-2003 INFO [JOIN] Spawning thread to connect to sponsor 188.8.131.52 INFO [JOIN] Spawning thread to connect to sponsor CLUSTERNODE01 INFO [JOIN] Asking 184.108.40.206 to sponsor us after delay of 0 milliseconds. INFO [JOIN] Spawning thread to connect to sponsor 220.127.116.11 WARN [JOIN] Unable to get join version data from sponsor 18.104.22.168 using NTLM package, status 5. WARN [JOIN] JoinVersion data for sponsor 22.214.171.124 is invalid, status 5. INFO [JOIN] Asking CLUSTERNODE01 to sponsor us after delay of 1000 milliseconds. WARN [JOIN] Unable to get join version data from sponsor CLUSTERNODE01 using NTLM package, status 5. WARN [JOIN] JoinVersion data for sponsor CLUSTERNODE01 is invalid, status 5. INFO [JOIN] Asking 10.27.101.175 to sponsor us after delay of 2000 milliseconds. WARN [JOIN] Unable to get join version data from sponsor 126.96.36.199 using NTLM package, status 5. WARN [JOIN] JoinVersion data for sponsor 188.8.131.52 is invalid, status 5. INFO [JOIN] Got out of the join wait, CsJoinThreadCount = 1. ERR [JOIN] Unable to connect to any sponsor node. WARN [INIT] Failed to join cluster, status 53
You would see this for all IP Addresses and Node names it tries to connect to. The Status 5 is an Access Denied type error. If you were to review the Cluster Log on the running node it is trying to join, you will not see any entries about a join taking place. The reason for this that it is not getting past NTLM security. If it cannot get passed the security, it will not get to Cluster. Therefore, nothing is logged.
Another thing you could run into is if the nodes are both in the Cluster and functioning properly. If you set a GPO Policy for one of the settings (further down) and those get applied to the nodes, you could see a node having its Cluster Service terminated unexpectedly. In the System Event Log, you would see the same Event 7031 as shown above. In the Cluster Log of this node, you could see something similar to:
WARN [EVT] EvtBroadcaster: EvPropEvents for node 1 failed. status 5 WARN [NM] RpcExtErrorInfo: Error info not found. WARN [EVT] EvtBroadcaster: EvPropEvents for node 1 failed. status 5 WARN [NM] RpcExtErrorInfo: Error info not found. WARN [EVT] EvtBroadcaster: EvPropEvents for node 1 failed. status 5 WARN [NM] RpcExtErrorInfo: Error info not found.
ERR [GUM] Update routine of type 1, context 0 failed with status 5 ERR [GUM] GumSendUpdate: Update on non-locker node(self) failed with 5 when it must succeed ERR [CS] Halting this node to prevent an inconsistency within the cluster. Error status = 5
In the Cluster Log of the node that stays running, you could see this:
INFO [GUM] GumSendUpdate: Dispatching seq 7222 type 1 context 4098 to node 2 INFO [GUM] GumSendUpdate: Locker updating seq 7222 type 1 context 4098 ERR [GUM] GumUpdateRemoteNode: Failed to get completion status for async RPC call,status 5 ERR [GUM] GumSendUpdate: Update on node 2 failed with 5 when it must succeed WARN [NM] RpcExtErrorInfo: Error info not found. ERR [GUM] GumpCommFailure 5 communicating with node 2 WARN [NM] RpcExtErrorInfo: Error info not found. INFO [NM] Received advice that node 2 has failed with error 5. INFO [NM] Received advice that node 2 has failed with error 5. ERR [NM] Banishing node 2 from active cluster membership.
All of the above things can happen when all three of the following are true.
- The account used to start the Cluster Service has a password of less than 15 characters.
- The Network security right "LAN Manager authentication level is set for “Send LM & NTLM responses” or “Send LM & NTLM – use NTLMv2 session security if negotiated” is set to:
Lmcompatibilitylevel: REG_DWORD: 0 or 1
- The Network security right "Do not store LAN Manager hash value on next password change" is enabled.
Nolmhash: REG_DWORD: 1
Instead of storing your user account password in clear-text, Microsoft Windows generates and stores user account passwords by using two different password representations, generally known as "hashes." When you set or you change the password for a user account to a password that contains fewer than 15 characters, Windows generates both a LAN Manager Hash (LMHash) and a Microsoft Windows NT hash (NT hash) of the password. These hashes are stored in the local Security Accounts Manager (SAM) database or in Active Directory.
If the Network security right "Do not store LAN Manager Hash value on next password change" policy is set , no LMHash is in the Cluster Service account (CSA) in the Active Directory.
When a password of less than 15 characters is used for the CSA, when you join the second node, open Cluster Administrator, or updates between the nodes occur, the process will generate the LMHash to build a session key to authenticate. Because no LMHash is stored in Active Directory, the Domain Controller cannot build a matching session key. So, the access is denied. When you use a password that has 15 or more characters for the CSA, an LMHash cannot be generated by the setup process. Instead, the Windows NT password hash will be used to derive the session key. The Domain Controller will be able to generate a matching session key and the authentication will succeed.
So to resolve this, you simply need to change only one of the three above. Once this is done, you should be good to go and no more access denied errors.
One thing to note, if the Network security: Do not store LAN Manager hash value on next password change is enabled, then you must set your Network security: LAN Manager authentication level to “Send NTLM response only” or above.
If you decide to go with the Cluster Service password change, you can use the /CHANGEPASS: command so that Cluster Service production is not needed to be taken down.
KB305813 - How to change the Cluster service account password
We have two articles that talk about this in a little more detail that you can also refer to:
KB823659 - Client, service, and program incompatibilities that may occur when you modify security settings and user rights assignments
KB828861 - Cluster service account password must be set to 15 or more characters if the NoLMHash policy is enabled
This information should resolve most if not all of the access denied problems you could be receiving.
Happy Clustering !!