“Failed to connect to hosts in the cluster” – SharePoint 2013

In my previous blog post, I have introduced my SharePoint 2013 farm server details. I have total four virtual machines configured. litdc, litsp1, litsp3 & litsql1.

Note: this is written for SharePoint 2013 Preview. Things might change over the next few months up until RTM

I have installed and configured SharePoint 2013 in litsp3 first as that was the server I chosen to host central administration site.

image

While adding my second server litsp1 I faced an issue which is a brand new error in SharePoint , it was causing while configuring the “Distributed Cache” (Appfabrik cache) service in that server. Same issue is mentioned in this blog post as well.

I was getting the same error message while running the PSConfig using UI & using Connect-SPConfigurationDatabase . More detailed PSConfig log given below.

08/04/2012 17:29:03 8 INF Resource id to be retrieved is ConfigurationDatabaseTaskConnectFailConfigDisplayLabel for language English (United States)
08/04/2012 17:29:03 8 INF Resource retrieved id ConfigurationDatabaseTaskConnectFailConfigDisplayLabel is Failed to connect to the configuration database.
08/04/2012 17:29:03 8 INF Leaving function StringResourceManager.GetResourceString
08/04/2012 17:29:03 8 ERR Failed to connect to the configuration database.
An exception of type System.Management.Automation.CmdletInvocationException was thrown. Additional exception information: ErrorCode<ERRCAdmin040>:SubStatus<ES0001>:Failed to connect to hosts in the cluster
System.Management.Automation.CmdletInvocationException: ErrorCode<ERRCAdmin040>:SubStatus<ES0001>:Failed to connect to hosts in the cluster ---> Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCAdmin040>:SubStatus<ES0001>:Failed to connect to hosts in the cluster
at Microsoft.ApplicationServer.Caching.AdminApi.CacheAdmin.GetAllowedServerVersionRange()
at Microsoft.ApplicationServer.Caching.AdminApi.CacheAdmin.IsClusterUpgradeInProgress()
at Microsoft.ApplicationServer.Caching.AdminApi.AdminUpgradeController.SetRollingUpgradeStatus()
at Microsoft.ApplicationServer.Caching.Commands.ConnectAFCacheClusterConfigurationCommand.BeginProcessing()
at System.Management.Automation.Cmdlet.DoBeginProcessing()
at System.Management.Automation.CommandProcessorBase.DoBegin()
--- End of inner exception stack trace ---
at System.Management.Automation.Runspaces.PipelineBase.Invoke(IEnumerable input)
at Microsoft.SharePoint.DistributedCaching.Utilities.SPVelocityPowerShellWrapper.GetCacheClusterInfo(String provider, String connectionString)
at Microsoft.SharePoint.DistributedCaching.Utilities.SPVelocityPowerShellWrapper.IsCacheClusterIntialized(String provider, String connectionString)
at Microsoft.SharePoint.DistributedCaching.Utilities.SPDistributedCacheClusterConfigHelper.IsCacheClusterInitialized(SPDistributedCacheClusterConfigStorageLocation cacheConfigStorageLocation)
at Microsoft.SharePoint.DistributedCaching.Utilities.SPDistributedCacheClusterInfoManager.IsSPDistributedCacheClusterInitialized(SPDistributedCacheClusterConfigStorageLocation clusterConfigStorageLocation)
at Microsoft.SharePoint.DistributedCaching.Utilities.SPDistributedCacheService.<EnsureSPDistributedCacheHost>b__0()
at Microsoft.SharePoint.DistributedCaching.Utilities.SPDistributedCacheService.RunWithRetries[T](Int32 maxAttempts, CodeToRunWithRetries codeToRunWithRetries)
at Microsoft.SharePoint.DistributedCaching.Utilities.SPDistributedCacheService.EnsureSPDistributedCacheHost()
at Microsoft.SharePoint.Administration.SPFarm.Join(Boolean skipRegisterAsDistributedCacheHost)
at Microsoft.SharePoint.PostSetupConfiguration.ConfigurationDatabaseTask.CreateOrConnectConfigDb()
at Microsoft.SharePoint.PostSetupConfiguration.ConfigurationDatabaseTask.Run()
at Microsoft.SharePoint.PostSetupConfiguration.TaskThread.ExecuteTask()

While looking at the call stack it was very clear that issue is while configuring the “Distributed Cache”. As a work-around I would have go with two ways, one is try joining the litsp1 server to the farm while specifying the following parameter.

Connect-SPConfigurationDatabase –SkipRegisterAsDistributedCacheHost

This will skip AppFabric Caching activation on the new server during farm join. Then, after join is complete start “Distributed Cache” on that server using the below PowerShell Commandlet.

Add-SPDistributedCacheServiceInstanceOnLocalServer.

Or, use the SQL Server Instance name instead of the SQL Client Alias as mentioned the above blog post (I haven’t tested this option though).

But, I was curious about the real root cause and how we can resolve it. There was a thread in TechNet which helped to finding the actual root cause and and resolve it.

Since litsp1 was trying to connect to the server farm and connect the app fabrik host in litsp3 it was failing to find the host, I have checked the DistributeCacheService.exe.config file in the following location in litsp3 C:\Windows\System32\AppFabric directory, and checked the ports used for appfabrik and confirmed that it was opened in firewall. Default port numbers were 22233, 22234, 22235, 22236. Below screenshot is for Inbound and there was outbound rule was also already defined by default.

image

As it mentioned in the TechNet forum, during the appfabrik cache configuration, it was trying to read the registry configuration of this service in litsp3 from litsp1. I was able to confirm it by following the below steps.

  1. In my litsp1 server, on the start menu, “Run” dialog.
  2. Launched RegEdit.exe.
  3. In the registry editor, went to the File menu.
  4. Clicked Connect Network Registry.
  5. Typed in the name of the other server (litsp3) and then “connect”.
  6. I got an error message says that it can’t connect and do any modification

To get it working first I have to start the “remote registry service” in the first server listp3.

image

But still it was not connecting to the registry from the litsp1 server. I figured it out that now the problem was with firewall. I have to enable the Inbound rule “Remote Service Management (NP-In)”.

image

But still it was not not connecting to the registry from the litsp1 server. I figured it that now the problem was firewall. I have to enable the Inbound rule “Remote Service Management (NP-In)”.

image

You may experience this issue if you have firewall enabled and there is an issue for remote registry access between the SharePoint Servers. This means, we have to make sure that the “remote registry service” need to be started in all SharePoint Servers before connecting to the SharePoint farm to configure the Appfabrik Cache. If you have firewall then may need further configurations as it may block the connection.