AppFabric Distributed Logon Token/ViewState Cache is Timing-Out - Continued

In a previous post I described how to troubleshoot AppFabric timeouts a problem which many customers have seen, most acutely when SharePoint is under heavy load. One thing consistent to almost every case is that the main victims/culprits for timeouts are view-state and login caches, and since writing the previous post I’ve refined what the values best suited for these are if you’re getting these errors and don’t have time to figure out more exact values for your environment. This post just focuses on these two caching areas.

In short, you should set some values much higher than SharePoint sets by default for these two caching containers if you need this to work and you don’t have time to increase the values based on tuning them over time. Normally I’d recommend increasing the values & measuring the impact before tuning again but often that’s just not a realistic possibility, hence this post.

There are four settings that I’ve found have the most impact to timeouts, and most of these can be cranked way past their SharePoint defaults.

Why Are SharePoint/AppFabric Default Values So Low?

This question is often asked, quite legitimately in my opinion. There’s no quick answer but here we go anyway.

Normally AppFabric is used as a standalone product for custom-coded applications which connect to a remote AppFabric cluster on separate machines. SharePoint on the other hand sits side-by-side AppFabric (mostly) so my guess is (and it’s never been confirmed) the thinking was therefore such high defaults would never be needed as most requests wouldn’t even hit the network.

Also, in a perfect world, SharePoint servers are perfectly dimensioned with enough CPU, memory, and network bandwidth for the users & code being run in all circumstances. In the real world however often said servers are working much harder than they should be, for more users and one of the first things to start dying is the distributed (non-critical) cache.

What Are The Risks of Increasing the Values?

Very few. Normally AppFabric is setup a shared service, often between completely unrelated applications and these values are designed to make sure no single “client” can cause any performance issues by flooding AppFabric with too much work, too quickly. Also, and to a lesser extent, AppFabric is a cache meaning “fast access” – if the cache becomes slower than the normal way of grabbing the data then entire point of using it is lost.

SharePoint has its very own AppFabric service instance so the worst thing that could happen is SharePoint’s AppFabric is overloaded by SharePoint itself & other parts of SharePoint functionality begin to suffer.

But that said, 95% of timeouts & load issues are because 95% of the usage is by view-state and/or logon cache. Therefore these containers should be configured with the most leeway for when things are under stress.

View-State & Logon Cache Client Defaults

Here are said key configuration values; the documented descriptions, SharePoint defaults and what I’d recommend setting the values to, based-on our very own AppFabric (standalone) documentation if you’re getting timeouts:

MaxConnectionsToServer

“Specifies the maximum number of channels to open to the cache cluster”.

SharePoint default is 8 max connections.

As for what you set it to, make it no more than 100 – the AppFabric default is just 1, but this really depends on how much your hardware & network can scale. I’d happily go to 20 connections without too much worry.

Note: there are apparently circumstances where this can exhaust the thread-pool in AF – avoid setting this too high, and change this setting last. For this reason I’ve removed this change from the scripts below.

ChannelOpenTimeout

“The length of time (milliseconds) that the cache client waits to establish a network connection with the server“.

SharePoint default is 100.

AppFabric documentation says 3000 is the recommended value (3 seconds). That sounds reasonable – a system under load can certainly take longer than 1/10th of a second so 3000 is good a good starting point.

ReceiveTimeout

“The length of time (milliseconds) to wait for a request before aborting the channel (milliseconds)”.

Sharepoint default is a mere 20 milliseconds.

Interestingly from just testing, it seems SharePoint web-front-ends will reset/re-open TCP connections at the same interval configured here which will can stack-up extra traffic just from the extra TCP handshakes. So given that plus the fact this tends to be the limit means I’d recommend going up-to the AppFabric recommendation of 60000 (10 minutes) for this setting.

RequestTimeout

“The length of time (milliseconds) that the cache client waits for a response from the server for each request”.

SharePoint default is again a mere 20 milliseconds.

I’d recommend upping this value to the AppFabric minimum recommendation of 1000 (1 second).

View-State Timeout Script

This should take care of any distributed cache timeouts:

$viewStateCacheSettings= Get-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache

$viewStateCacheSettings.requestTimeout = 1000

$viewStateCacheSettings.receiveTimeout = 60000

$viewStateCacheSettings.channelOpenTimeOut = 3000

Set-SPDistributedCacheClientSetting -ContainerType DistributedViewStateCache $viewStateCacheSettings

Login Token Timeout Script

This should wrap-up any timeouts with the view-state:

$logonCacheSettings= Get-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache

$logonCacheSettings.requestTimeout = 1000

$logonCacheSettings.receiveTimeout = 60000

$logonCacheSettings.channelOpenTimeOut = 3000

Set-SPDistributedCacheClientSetting -ContainerType DistributedLogonTokenCache $logonCacheSettings

Does AppFabric Need to be Restarted for Configuration Changes?

Yes and no. Yes to be sure, is the short version. Restart on each node, taking care to do it cleanly of course.

I hope this has helped clear-up the timeout errors with AppFabric! Feedback on what’s missing is always welcome.

 

Cheers,

Sam Betts