Introduction to Troubleshooting AppFabric Reliability Issues for SharePoint


AppFabric & distributed cache issues in SharePoint is something that comes up with reasonable regularity, often because SharePoint admins have never been introduced to this extra layer of SharePoint before. Read this if you either want to get to know how SharePoint & AppFabric/distributed-cache hang together and/or you want to know the basics of troubleshooting it.

I’ve touched on this before but never done a n00bie introduction version with pretty graphics, so here it is.

First up though, the basics of how SharePoint hangs together with AppFabric.

Parallel Universes: SharePoint + AppFabric

Something that you should learn upfront is that, despite being packaged as one product, SharePoint & AppFabric really are separate universes. SharePoint uses AppFabric like it uses SQL Server; SharePoint is just a consuming client of AppFabric but unlike SQL Server, SharePoint needs AppFabric’s configuration to mirror SharePoint’s own configuration. Herein lies the root of most problems that we see when things go wrong; read on…

SharePoint has a list of servers it thinks is managing the distributed cache service, just like any other service in the farm – user-profiles for example. The complication is that AppFabric has its own, very separate list of servers which can sometimes get out of sync on rare occasions.

Confused SharePoint Cat

That right Mr Cat; SharePoint caching isn’t done by SharePoint. Here’s what logically we should have, on a good day:

image

When we add a “cache instance”, on a normal day we’ll add to both lists:

image

Running Add-SPDistributedCacheServiceInstance on SharePoint server “server3” would give us (strangely enough) this:

image

Not exactly rocket-science, but a lot of SPAdmins aren’t aware of this parallel adding going on under the hood.

Obviously, removing a cache server does the reverse:

image

So far so good. Servers are normally added & removed with these SharePoint cmdlets:

Normally these commands work too, but not always…

What if Something Bad Happens When You Add/Remove a Cache Server?

Good question. The PowerShell cmdlets aren’t overly elegant in that if either part of the add/remove fails, the cmdlet in question just powers-on to the end anyway, albeit while reporting the error at least.

For example, we try and add a server but port 2223 is already open so AppFabric refuses to add the machine to the cluster; this happens:

image

Now we have an imperfect match so AppFabric & SharePoint won’t agree what servers are in the cluster.

image

Bad times in the caching world. The world won’t end but you have an unhealthy cluster there.

The reverse can happen too of course where SharePoint provisioning fails, normally because SharePoint will complain “cacheHostInfo is null” but AppFabric will have added the server just fine.

The point is we have a mismatch, and it’ll need to be fixed.

How to Find Out if I Have a Server Mismatch?

This is pretty easy; you need to run two scripts to query each “side”. To see what cache-servers SharePoint thinks there are run this:

  • Get-SPServiceInstance | ? {($_.service.tostring()) -eq "SPDistributedCacheService Name=AppFabricCachingService"} | select Server, Status

For AppFabric run this from a machine already in the cache-cluster:

  • Use-CacheCluster
  • Get-CacheHost

Run both SharePoint & AppFabric snippets. You’ll see something like what’s below:

clip_image014

Study this output well as it tells you all you need to know if your distributed cache is healthy or not.

The names on the machines need to match (SharePoint uses the NetBIOS names instead of the FQDN, but they’re still obviously the same).

Servers Aren’t “Up” in AppFabric

Finally; SharePoint needs to think the service-instance is “online” and AppFabric needs to think the service is “up”, on each server. If it’s not, find out why. When AppFabric says “up” it means “I am having constant communication with this server” rather than SharePoint’s version, which just means “in theory, this server could be used for this service (but who knows if it’ll work)”.

AppFabric not being “up” is normally a case of firewall issues preventing communication in other words.

Fixing Problematic Distributed Cache Servers

So one server doesn’t look right in one of the outputs or the AppFabric service on that machine is having problems or something – what should you do?

  1. Try to remove then add the server the nice way.
  2. If that didn’t work then get your hands dirty removing the server, finally add again the nice way.

The idea is to flush out the breaking server from both SharePoint & AppFabric, then add it again. It’s not beautiful but it fixed 95% of AppFabric issues in my experience. So, on the offending server run:

Did you get any error adding the server again? If not, you probably now have a good configuration in SharePoint & AppFabric for the server. Compare the outputs from both sides again to check.

Did you see an error when you added it again? Which part failed when you added it? Read on…

AppFabric Didn’t Like the New Server

This is normally for 2 reasons; the server “exists” already in AppFabric or the port was still in use.

If the server is “already” in AppFabric, run:

  • Unregister-CacheHost -HostName [machine] -ProviderType SPDistributedCacheClusterProvider -ConnectionString \\[machine]

Change [machine] for the FQDN of the breaking server in both locations in that line (don’t remove the “\\” part – you’ll need that).

Now run both remove/add commands again (see above).

SharePoint Didn’t Like the New Server

SharePoint will fail the “Add-SPDis…” if it too thinks the server has an instance running this service, even if it’s disabled. Strange but true.

If SharePoint gave an error adding the server, run:

  • $instanceName ="SPDistributedCacheService Name=AppFabricCachingService"
  • $serviceInstance = Get-SPServiceInstance | ? {($_.service.tostring()) -eq $instanceName -and ($_.server.name) -eq “[machine]”}
  • $serviceInstance.Unprovision()
  • $serviceInstance.Delete()

Change [machine] for the NetBIOS name of the breaking server. The “Unprovision” command will probably fail if the service-instance is broken but don’t worry if it does.

This should clean SharePoint out. Now run both remove/add commands again (see above).

Still No Joy?

Well you might need to get your hands dirtier still. See my other article which looks in slightly more depth at this issue, or my good colleague Filip Bosmans has written a nice script which checks the general health of the cache-cluster all round, so can make some of these checks more automatic. If you’re having AppFabric issues, try his script out here.

Wrap-Up

Hopefully this should’ve helped explain how these two products work together, and some basic tricks to handling most of the problems that come up.

Cheers,

Sam Betts

Comments (4)

  1. Jason Vickers says:

    We ran into both sides of this issue at Gulfstream.  This is a really good explanation of something that's not that obvious.  

  2. Thanks for the feedback! 🙂

  3. Hi,

    Issue with "port already in use" error is fixed in AppFabric CU7 support.microsoft.com/…/3092423

  4. Morris says:

    Thank you for this post. I was looking for a solution concerning a Feed Cache Repopulation Job error, which lead to a non working Newsfeed. Everything work’s fine again!

Skip to main content