SharePoint 2013 + Distributed Cache (AppFabric) Troubleshooting

Two messages you may have seen if you’ve administered SharePoint 2013 in anyway way about caching are “This Distributed Cache host may cause cache reliability problems” and/or “cacheHostInfo is null” from PowerShell. This article is about how to fix those errors & caching reliability problems in general for SharePoint 2013.

Update: see a simplified version of this article here if you're not sure how AppFabric works with SharePoint.

Cache reliability warnings are fairly common to see in SharePoint 2013 installations of any complexity. It’s to do with how SharePoint interacts with the distributed cache cluster that’s used for all sorts of caching needs in 2013 from caching user tokens (with a fall-back option if it fails), to security trimming search results (also with fall-back on failure), to the social news-feed (with no fall-back – social just doesn’t work without a healthy cache cluster), all powered by AppFabric. For the most part a cache failure just means less than optimal performance but not always.

Therefore if you see this message in SharePoint you should pay attention to it. Here’s an example health error:

image

This message can come up for several reasons but in short, one or more servers that SharePoint thinks should be hosting the cache cluster, isn’t, for one reason or another. This guide will hopefully show how to fix this rather broad issue, but it depends on what the problem is first so to start you need to pick a scenario that describes your own…

Scenario 1 –SharePoint and AppFabric Don’t Agree Which Servers are in the Cluster

As already mentioned, SharePoint uses AppFabric for caching under the hood, which is an entirely standalone product in its’ own right. This means that AppFabric has its own ideas about what machines should make up the cluster in parallel to SharePoint. Normally this list of servers perfectly coincides so nobody notices AppFabric is even a thing until there’s a problem but any mismatch in server-info between the two products can often cause some pretty ugly problems and is often the root cause of the infamous “cacheHostInfo is null” error. The two server-lists need to be identical (and healthy) so let’s check both…

Query AppFabric for Caching Servers/Statuses

To find out, get the list of servers AppFabric thinks there should be run “Get-CacheHost” (use “Use-CacheCluster” if necessary). This command gives us a bit more than just the servers but also each servers’ serviceability status as far as AppFabric’s concerned.

Query SharePoint for Caching Servers/Statuses

To do the same for SharePoint, run:

Get-SPServiceInstance | ? {($_.service.tostring()) -eq "SPDistributedCacheService Name=AppFabricCachingService"} | select Server, Status

This will give you the same kind of data but from SharePoint’s POV instead. Make sure all statuses say “Online” but more importantly that both SP & AF have the same names between them. As mnentioned before, if you’re seeing “cacheHostInfo is null” somewhere then it’s quite likely there’s a mismatch here.

Oh No! AppFabric and SharePoint Server Lists Don’t Match!

Maybe AF thinks there are more servers caching than SharePoint does; maybe the server names don’t coincide. Here’s an example of a server-name mismatch:

image

Even if the names matched by the way, this particular example would also fail because the service-instance is disabled but for now let’s just focus on the name mismatch, which will indeed cause all sorts of cache reliability problems too.

It’s probably going to be AppFabric that’s got a server that SharePoint doesn’t think is caching anything, possibly because said server isn’t in the farm anymore or at least the name of the server isn’t (renaming a server with Rename-SPServer at the time of writing won’t update the name in AppFabric too, causing this type of mismatch. A small “feature” if you will).

In any case, AppFabric and SharePoint need a coinciding list and SharePoint needs the service-instance to be “online” (not “disabled”).

How to Remove Zombie AppFabric Service Instances from SharePoint Topology

If as is more common you need to also remove AppFabric instances from SharePoint, say because the service-instance is disabled, you can do it with this command:

$instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
$serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName -and ($_.server.name) -eq $env:computername}

$serviceInstance.Unprovision()
$serviceInstance.Delete()

This PowerShell snippet (tries to) un-provision the service on the server (which might fail) then removes the service-instance from the SharePoint configuration database. If you look at the query, we pick out the service-instance that matches this machine-name so there’s no danger of it doing anything wrong as long as it’s run on the right machine PowerShell console.

You can do this from any machine for any other machine if you change the last where clause that passes in “this computer name”. For my example above I’ll change the computer-name to “sfb-sp15-wfe1” as that’s the server that has the bad service-endpoint.

How to Remove Ghost Servers from AppFabric

We need to remove any server that just doesn’t exist in the farm in any way. However if there’s a server in AppFabric that is in the farm but just shouldn’t be hosting AF do not use this method; try running “Remove-SPDistributedCacheServiceInstance” on the farm server in question first.

If on the other hand, manually ripping out the host from AppFabric cluster is the last resort, this is how. From a machine that is still working in the cluster (if possible), run Unregister-CacheHost passing in the name of the server to remove + the SharePoint provider + “connection-string” as so:

Unregister-CacheHost -HostName [machine] -ProviderType SPDistributedCacheClusterProvider -ConnectionString \\[machine]

Replace [machine] with the NetBIOS name of the machine you want to evict. In my example it would be:

Unregister-CacheHost -HostName sfb-sp15-wfe1.sfb-testnet.local -ProviderType SPDistributedCacheClusterProvider -ConnectionString \\sfb-sp15-wfe1.sfb-testnet.local

Once all phantom hosts have been eliminated from AppFabric, all being well we should have a healthy-if-slimmed-down cluster we can re-add other nodes to in the normal way with Add-SPDistributedCacheServiceInstance – which adds to AppFabric and SharePoint both, as the good SPLord intended. Before doing so, verify one more time that both SharePoint and AppFabric have the same server-list and that AppFabric says the server is “up” and SharePoint says the service-instance is “online”.

One More Time: Verify Service End-Points and AppFabric cluster Agree on Servers

All servers need to be in the AppFabric cluster and host an AppFabric service-instance in the farm, and be online:

image

Having cleaned out the rogue entries, I’ve gone back and added the other servers too with Add-SPDistributedCacheServiceInstance which sorts out both the SP and AF configuration at once.

Until you achieve this exact parity do not continue. The AppFabric hosts don’t necessarily need to be “up” at this time but the names have to coincide and SharePoint needs to have the service-instances online.

At this point your caching woes may even be over! In Central Administration get SharePoint to recheck any health-warnings about distributed cache.

Scenario 2 – No Server Mismatch but One or More AppFabric Service Instances are Disabled

At this stage we’ve verified the server lists between SP and AF match-up. Run this PowerShell command to find out if we have zombie endpoints in SharePoint:

Get-SPServiceInstance | ? {($_.service.tostring()) -eq "SPDistributedCacheService Name=AppFabricCachingService"} | select Server, Status

If any status say “disabled” then you have a problem. You need to:

If for some reason Add-SPDistributedCacheServiceInstance doesn’t give you a healthy endpoint, try running Remove-SPDistributedCacheServiceInstance then Add-SPDistributedCacheServiceInstance on the server in question. If you still can’t get a healthy endpoint after all that you’ll probably need to contact Premier support.

Scenario 3 – AppFabric & SharePoint Agree on Cache Servers but Some Servers are Down

In this scenario both products are on the same page about who should be caching but one or more nodes just aren’t for some reason or other.

Problem: Servers use Dynamic or Shared Memory

AppFabric is particularly sensitive to dynamic/shared memory. It can work on it but Microsoft doesn’t support it and if you wanted our help with an AppFabric cluster we wouldn’t do much unless each server had a fixed amount of memory, always.

Now the disclaimers’ done; I’ve had it working just fine with testing VMs on a dynamic VM using around 16gb; I tend to find that if memory usage expands suddenly and the host OS can’t provide the guest OS memory quick enough AppFabric will just give up and you’ll have to re-provision it all over again. The moral of the story here is, don’t be cheap on memory and expect AppFabric to work. Really, don’t, especially for anything that’s not your dev-box.

Problem: AppFabric Server Configuration State is Corrupt

First of all let’s see if the failing node even knows about the cluster. I’ve had a couple of occasions where the configuration has just died for various reasons and has just had to be reset. Run a check by getting the local cluster status with “Get-CacheHost” (use “Use-CacheCluster” if necessary).

image

This would suggest the cluster configuration on this failing node has died for reasons we don’t know, nor particularly care about assuming it’s not a regular occurrence. Cache clusters are trivial to setup so let’s just jump to the solution…

If you see the “cacheHostInfo is null” message during any of those, remove the service instances from SharePoint and the host from the AppFabric cluster as shown above, then repeat the remove/add commands.

Problem: AppFabric Service not Started

You’ll get reliability problems if the service isn’t started.

image

This is bad. This however is good:

image

If the service won’t start for some reason then I’d try removing & re-adding the server with Remove-SPDistributedCacheServiceInstance and Add-SPDistributedCacheServiceInstance.

Problem: Firewall Interference

Firewalls are a consideration for AppFabric. You should be able to see lots of chatter on port 22234 which is the internal cluster-chatter port. You should also see some activity on 222233 which is how SharePoint talks to the cluster; just make sure you don’t see any TCP resend packets being sent consistently.

image

Each cluster node needs these ports open between themselves and network tracing skills come in pretty handy here if you’re not sure if the ports are open.

More information about ports needed @ https://msdn.microsoft.com/en-us/library/ee790914(v=azure.10).aspx

Edit: my good colleague Filip Bosmans has written a nice script which checks the general health of the cache-cluster all round, so can make some of these checks more automatic. If you're having AppFabric issues, try his script out here.

Wrapping Up

Getting this right isn’t as easy as you might think. For the most part the caching and AppFabric just taking care of itself but there’s clearly a need to get your hands dirty now & then. Many people don’t even realize SharePoint just drives AppFabric and how that setup works; fixing these issues is mainly about understanding how to troubleshoot these two products as one.

Let me know if there’s any scenarios I haven’t covered in the comments – this is something I’d like to add to over time if needed. Thanks to my colleagues Filip Bosmans, Vlad Mihat, and others for helping out with this.

Cheers,

// Sam Betts