Graceful SharePoint AppFabric Restarts


Many people have asked about how to cleanly restart an AppFabric server so data in the cache isn’t lost and may have even found they’ve not been able to get it to work themselves. It’s a good question; I hope to answer how here to some extent, partly because the official commands don’t actually work so well by default.

Update: the TechNet documentation has since been updated with a script that should work just as well as the guide below. Below explains what’s happening in the script.

First, a quick test to demonstrate the cache working so we can see when it breaks easily. Here I’ve got a small I app made to make a new post every 2 seconds  to my own social feed.

SharePoint AppFabric

Pretty simple concept. The key part to this test is that social feeds only show up in SharePoint when AppFabric is working nicely:

SharePoint AppFabric

If AppFabric dies unexpectedly and cache-data is lost, then quite simply social feeds won’t appear correctly.

Side-point; you’ll see lots of people complaining about social data not appearing, incidentally. It’s because there’s nothing in the cache, probably because the cache is broken and needs to be repaired.

But I digress; the newsfeed makes for a nice visual test for when we’ve broken our test AppFabric cluster.

The AppFabric Server Restart Tests

So what we’re going to do is two tests to show the right & wrong way of restarting an AppFabric machine; what happens when AppFabric breaks, and why. My environment has x2 AppFabric servers. We’ll reboot one and see what happens; then we’ll do the same again but with a graceful shutdown first this time.

Breaking AppFabric Test – The Norm

So first let’s run a test that’ll do horrible things to our AppFabric cluster; otherwise known as just rebooting an AppFabric machine like you would normally. Here’s the healthy cluster state; all servers are online and servicing caching requests:

SharePoint AppFabric

Now to reboot server our victim server “search-idx” just like you would on any other day.

Bang! Adios, AppFabric (until it can get its knickers untwisted again, which it eventually will). Let’s look at the damage.

Get-CacheHost reports the machine offline and generally gets quite confused:

clip_image008

…and lo, the social feed is apparently “collecting”.

SharePoint AppFabric

Interestingly, you’ll see lots of cache fraction suddenly unallocated if you run Get-CacheClusterHealth:

Unallocated named cache fractions

---------------------------------

NamedCache = blah

Unallocated fraction = 5.12

…and so on.

Unallocated fractions are basically cache segments don’t have a server (because the server they did have dropped off the cluster suddenly), so our social feed and anything else that needed that cache will have to load it again. Eventually of course the cache will rebuild and there’ll be no more unallocated fractions again but it takes a while, depending on the amount of data in there, servers to copy to, speed, system load, etc.

Now let’s see how to restart an AppFabric server without causing any hiccups.

A Graceful AppFabric Restart

This time we’ll run the graceful shutdown before rebooting, first this command (changing the hostname of course):

Stop-CacheHost -HostName sp15-search-idx.sfb-testnet.local -CachePort 22233 -Graceful

Now I bet that surprised you; in the official SharePoint/AppFabric documentation we’re told to run “Stop-SPDistributedCacheServiceInstance -Graceful”. However for reasons too complicated to go into here, let’s just say for now that the official stop command is far from graceful – the service is in fact dropped like a hot potato and anything on that host in AppFabric goes with it.

More on that another day (it’s not a simple subject); for now though running Stop-CacheHost will work as expected instead and will give you something like this output for “Get-CacheHost” once it’s executed:

Edit: the updated documentation @ https://technet.microsoft.com/en-us/library/jj219613.aspx#graceful has a nice script to automate this graceful shutdown.

SharePoint AppFabric

Notice the “SHUTTING DOWN” service status against our server in question. It means it’s offloading cache fragments to everyone else in the cluster still “UP”, and not adding any new fragments either, but still effectively online for cache queries. Very graceful indeed.

Monitoring AppFabric Graceful Shutdowns

As the node is shutting down, Get-CacheClusterHealth will show the said node shrink its’ “healthy” object count while all the other server(s) will increase them. It can take a while (15-20 mins) & is quite boring to watch in fact, but if you must wait until it’s done then refreshing the command will show the numbers slowly shift towards the nodes that aren’t shutting down.

You’ll know it’s ready when Get-CacheHost eventually shows the shutting-down host as just “DOWN”. When the server status is “down”, you can finally run the other command in the documented process to complete the server shutdown prep:

Remove-SPDistributedCacheServiceInstance

This will remove the server from the farm topology for any AppFabric requests, which in reality just means that SharePoint is kept on the same page AppFabric is about what machines should be servicing cache requests, but does little else.

Now you’re finally ready to reboot your server with the peace of mind your AppFabric cache won’t be impacted!

Once the machine has rebooted, re-add the server to the AppFabric cache cluster again with:

Add-SPDistributedCacheServiceInstance

What Happens If AppFabric Cache Doesn’t Shutdown Gracefully?

Not much as it happens. In short, no data is lost past any immediately uncommitted “likes” or “follows” or whatever’s going on in the baked-in social functionality (as opposed to Yammer, which frankly is better). Logon tokens may be lost which may result in one or two users being politely asked to login again, but SPLife will basically go on without much drama.

As mentioned before though, the biggest victim tends to be the social feeds if you use them, which may out die-out depending on what cache-chunks were lost. Social can be repopulated pretty much instantly with:

Update-SPRepopulateMicroblogLMTCache

Update-SPRepopulateMicroblogFeedCache

Conclusion: losing temporarily cache by definition isn’t a big drama in SharePoint-land normally, assuming AppFabric is working under normal conditions at least.

That’s it for now. I know some may be asking why the official guidelines on graceful shutdowns don’t work by default; let’s just say we’re looking at it.

In the meantime though, this workaround will work nicely if you absolutely need AppFabric smooth running for whatever reason. I hope this has helped – feedback is always welcome!

Cheers,

// Sam Betts

Comments (9)

  1. Mafra says:

    Thanks for you great post!

    One of our customer's Solution is heavily based on those newsfeed on their personal sites, and every time we had to restart one of the cache host for any reason, it was a big problem, specially when some users "lost" entries there.

    Once again, great post and thanks a lot!

  2. Andy says:

    So, in fact, does that imply that SharePoint Server Farms consisting of solely 1 SQL, 1 SP Server, are no more viable if SP-Social Features are needed?

    I really Need a definit answer here.

    Kind regards.

  3. Hi Andy,

    Single servers with SQL + all SP roles aren't supported for production 2013 farms, it's now a minimum of SQL + App + WFE setup (if you want support from us anyway). Testing environments are fine with all-in-one servers – technet.microsoft.com/…/cc262485.aspx.

    Aside from that; I'd recommend running AppFabric even if you don't need the social functionality. There's lots of other stuff in there that the core product uses throughout.

    // Sam

  4. Kapil says:

    Sam, this is extremely helpful post. I have been using "graceful" switch with remove-spdistributedcacheserviceinstance  command but was never sure if it completely transferred social feed cache to another host or not. I simply waited for 15 minutes then bounced the server. We do not have a lot of users on social feed but knowing we will not loose their social activity feed on every maintenance window, gives confidence to encourage users to use it.

    Big thanks again!

  5. VersatileMike says:

    Sam,

    What if you only have one cache host? Everything I have read says not to run distributed cache on a server running search. I have SQL + App + WFE and currently only the WFE is running distributed cache because the App server is running search.

    Thank You,

    VersatileMike

  6. You can run AppFabric/DistCache on the search machine but obviously there'll be more contention for resources. If you need AF/DC to stay up then you'll have to add it there or on another new server in the farm.

  7. VersatileMike says:

    Thanks for the reply.

    Will I lose data if I just stop the AF/DC on the WFE and reboot?

    I guess I could add the APP server to the farm. Stop the WFE service, reboot. Re-add the WFE then remove the APP server from the farm.

  8. tbh I'd just start AF on both machines or get another if you're really worried about it. As mentioned in the post, unless you're using SP social you'll probably not notice much has happened assuming one machine is still up; clean shutdown or not.

  9. Bogdan Grozoiu says:

    Hi Stefan,

    I wrote this PS for checking how to gracefully shutdown a dist cache machine and what to do before removing it.

    blogs.technet.com/…/how-to-run-a-graceful-shutdown-of-distributed-cache-before-restarting-the-server-with-admin-rights.aspx

    MfG,

    BogdanG

Skip to main content