AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2)

Part 1: AppFabric Caching and SharePoint: Concepts and Examples
Part 2: AppFabric Caching (and SharePoint): Configuration and Deployment

The Distributed Cache Service

SharePoint’s distributed caches are hosted and maintained by the Distributed Cache Service, itself a thin wrapper over a Windows Server AppFabric cluster. Understanding SharePoint’s Distributed Cache Service requires an understanding of AppFabric together with a few details on SharePoint’s implementation. Most of the details here apply equally to non-SharePoint AppFabric cluster as well; just view SharePoint as a particular implementation example.

AppFabric Physical Architecture

Key details of AppFabric’s physical architecture are described here and represented in this diagram (from the same location).

In SharePoint’s implementation, web and service applications are the “Cache-enabled application servers (cache clients).” The “Cache Servers” are SharePoint servers where the Distributed Cache Service Instance has been installed and enabled, and the “Cluster configuration storage location” is the CacheClusterConfig table in the SharePoint Configuration Database.

The CacheClusterConfig table in the Configuration Database stores configuration items as typed Key/Value pairs using a custom ICustomProvider implementation. The original values are XML snippets describing caches, cache hosts, and other properties. They are serialized and converted into byte arrays for storage, but most data can be deserialized and viewed using the Export-AFCacheClusterConfiguration cmdlet.

AppFabric Cache Hosts in SharePoint

Management of Distributed Cache Service Instances (AppFabric Cache Hosts) in SharePoint is different than management of most SharePoint service instances. Most service instances always remain installed on servers in the farm, whether online or not. These service instances are like Windows services, which are always installed on a server whether they’ve been enabled or not. For example, the User Profile Sync Service Instance is typically only online and running on one server in the farm, but it’s installed – and disabled - on all servers. To see a list of all service instances installed on a given server, both online and disabled, run the Get-SPServiceInstance cmdlet, using the –Server parameter to limit results to a particular server. The Services on Server page in Central Administration displays the same information.

Unlike other service instances, though, the Distributed Cache Service Instance should either be installed *and* online on a SharePoint server, or not installed at all. If the service instance is stopped (disabled) but not uninstalled, details about the associated Cache Host stay in the Cache Cluster Config store, which can cause problems.

For this reason, the Distributed Cache Service Instance should never be stopped via the Services on Server page in Central Administration or via Stop-SPServiceInstance in PowerShell. A special cmdlet, Remove-SPDistributedCacheServiceInstance, is available to stop *and* uninstall the local Distributed Cache Service Instance from a SharePoint server. This cmdlet, and its complement Add-SPDistributedCacheServiceInstance, should be used instead of Stop- and Start-SPServiceInstance for managing the local Distributed Cache Service Instance.

By default, the Distributed Cache Service Instance is installed on every SharePoint server when it’s joined to a farm. If you prefer to not install the Distributed Cache Service Instance at join time, you can specify -SkipRegisterAsDistributedCacheHost when running the Connect-SPConfigurationDatabase or New-SPConfigurationDatabase cmdlets. Note that at least one server must be running the Distributed Cache Service Instance for the farm to function properly.

For all AppFabric clusters, a simple command for listing all known cache hosts in the cluster is Get-AFCacheHostStatus. Don’t forget to run Connect-AFCacheClusterConfiguration before running other AppFabric cmdlets. To retrieve a list of SharePoint servers running the Distributed Cache Service Instance, you can run the following PowerShell command:

PS:> Get-SPServer | ? {($_.ServiceInstances | % TypeName) -contains 'Distributed Cache'} | % Address

We’ll discuss more details about cache hosts soon, but first let’s discuss AppFabric’s logical infrastructure.

AppFabric Logical Infrastructure

Details regarding the logical infrastructure of an AppFabric cluster are provided here and in this diagram (from the same location):

The basic logical entity in an AppFabric Cache Cluster is a Named Cache (frequently just called a Cache). A Named Cache is a container for cached objects. The ten SharePoint caches listed before are each AppFabric Named Caches. As illustrated here, Named Caches span all hosts in the cluster, distributing items for storage across allocated memory on all servers; however, cached items within a Named Cache are stored only once (by default). This is an important consideration when planning cache infrastructure, so let’s spell it out again: by default (and in SharePoint), cached items in an AppFabric Named Cache are stored only once across the entire cluster. If the cache host storing that cached item crashes or is shutdown non-gracefully, that item is no longer available in the cache.

In the previous paragraph we introduced cache items. Items are often collected and stored in a Region, which is a sub-collection of cached items within a specific Named Cache. Storing items in a shared region can make retrieval of the entire related collection easier. Like individual cached items, though, regions exist on only a single host in the cluster by default (and in SharePoint). So if the server hosting the region is lost, all items in the region are lost with it. Also, note that all items in the Region, and the Region itself, are stored on a single cache host.

Both individual cached items and regions can co-exist in the same named cache. I believe that in SharePoint all cached items are contained within regions.

To list all named caches and regions in a cluster, run Get-AFCache | Format-Table –AutoSize. For a list of all caches, run Get-AFCache | Format-Table CacheName.

With the infrastructure described, let’s dive more deeply into configuration of caches and cache hosts.

Cache Configuration Details

As we consider configuration details for caches and cache hosts, our conversation will be dominated by memory management, resiliency, and availability issues. We’ll discuss expiration and eviction of items, throttling of requests, and redundant storage. As before, we’ll discuss concepts in general and use SharePoint as a specific example.

Later, we’ll discuss configuration of cache hosts, but first, let’s begin by discussing configuration of individual caches.

To view common configuration details for individual caches, run the following command:

PS:> Get-AFCache | % {Get-AFCacheConfiguration -CacheName $_.CacheName}

Output for the ViewState cache is displayed here:

CacheName : DistributedViewStateCache_f3bd4763-f482-4bb8-a5a5-f40806460bdd
TimeToLive : 10 mins
CacheType : Partitioned
Secondaries : 0
MinSecondaries : 0
IsExpirable : True
EvictionType : LRU
NotificationsEnabled : False
WriteBehindEnabled : False
WriteBehindInterval : 300
WriteBehindRetryInterval : 60
WriteBehindRetryCount : -1
ReadThroughEnabled : False
ProviderType :
ProviderSettings : {}

The meanings of each of these properties is as follows.

  • CacheName: The internal name of the cache.
  • TimeToLive: The default time span until expiry for cached items. Note that this can be overridden for any individual item, and in SharePoint there are different standard TTLs used for items in each named cache, as in the below table. Time until expiry has an important impact on eviction and memory management which will be discussed in the section on cache host configuration.
Cache Name TTL Configuration Location
ActivityFeed 168 hours UserProfileApplication.FeedCacheTTLHours
ActivityFeedLMT 168 hours UserProfileApplication. FeedCacheLastModifiedTimeTtlDeltaHours
LogonToken 10 hours SPSecurityTokenServiceManager.WindowsTokenLifetime
ServerToAppServerAccessToken 24 hours (hard coded)
ViewState 31 minutes SPWebApplication.FormDigestSettings.Timeout + 1
Search    
SecurityTrimming    
Default    
Access 1 hour (hard coded)
Bouncer 1 hour (hard coded)
  • CacheType: How data is stored in the cache’s storage medium. Partitioned is the only option.
  • Secondaries: How many additional replicas of cached data are to be stored. Additional replicas provide redundancy and resiliency for cached items, as replicas are always stored on a different storage node. This is always 0 in SharePoint, where high availability is not currently supported.
  • MinSecondaries: The minimum number of secondaries which must be online to allow writing to the cache. By default, it’s the same as the number of secondaries configured for the cache. Always 0 in SharePoint, where there are no secondaries.
  • IsExpirable: Whether items in the cache are to be evicted after their TTL passes. Always True for SharePoint caches.
  • EvictionType: The algorithm used to evict non-expired items when a cache’s high watermark is passed. Can be set to LRU (Least Recently Used) or None. For most caches in SharePoint, this is set to LRU. For the ActivityFeedLMT cache, this is set to None. See the section on cache host configuration for details on eviction.
  • NotificationsEnabled: Whether the cache will notify subscribers when cached items are changed or deleted. Always False in SharePoint caches.
  • Read-Through and Write-Behind properties: The remaining properties specify details on Read-Through and Write-Behind for the cache. For details on RTWB concepts, see this MSDN article. SharePoint caches don’t utilize RTWB.

Most SharePoint caches have the same configuration. However, run this command to note that the ActivityFeedLMT cache has an EvictionType of None.

PS:> Get-AFCache | % {Get-AFCacheConfiguration -CacheName $_.CacheName} | Format-Table CacheName, EvictionType

Note that there are no quota-related properties specified at the cache level for SharePoint caches, or by default for any caches.

You may want to know how many items are stored and how much memory is in use for individual caches. For specific stats about each cache, run this command:

PS:> Get-AFCache | % {
$CacheName = $_.CacheName
Get-AFCacheStatistics -CacheName $CacheName | Add-Member -MemberType NoteProperty -Name 'CacheName' -Value $CacheName -PassThru
}

I’ve added a little formatting and cleanup to get detailed information about each cache together with its name. You could pipe the output from this command to Export-Csv to create a short report. Typical output is shown here:

CacheName : DistributedLogonTokenCache_f3bd4763-f482-4bb8-a5a5-f40806460bdd
Size : 36864
ItemCount : 6
RegionCount : 6
RequestCount : 55
ReadRequestCount : 26
WriteRequestCount : 14
MissCount : 32
IncomingBandwidth : 96924
OutgoingBandwidth : 880

Having illustrated key elements on configuring individual caches, let’s move on to configuring cache hosts.

Cache Host Configuration Details

To retrieve configuration details for cache hosts, run the following command, which retrieves information about each host currently in the cluster:

PS:> Get-AFCacheHostStatus | % {
$Status = $_.Status
Get-AFCacheHostConfiguration -ComputerName $_.HostName -CachePort $_.PortNo |
Add-Member -MemberType NoteProperty -Name 'Status' -Value $Status -PassThru
} | Format-List -Property *

I start this command with Get-AFCacheHostStatus since it returns all hosts in the cluster without further parameters necessary, unlike Get-AFCacheHostConfiguration. Note however that Get-AFCacheHostStatus attempts to ping each host in the cluster in order to report on status, and the timeout for this ping is 10 seconds. For a faster version of this command, at least for SharePoint servers, try this:

PS:> $SPDCServers = Get-SPServer | ? {($_.ServiceInstances | % TypeName) -contains 'Distributed Cache'} | % Address
PS:> $SPDCServers | % {Get-AFCacheHostConfiguration -ComputerName $_ -CachePort 22233}

Of course, you won’t get a Status without Get-AFCacheHostStatus.

Output from the first command will look like the following:

Status : Up
HostName : SERVER09.gavant.local
ClusterPort : 22234
CachePort : 22233
ArbitrationPort : 22235
ReplicationPort : 22236
Size : 600
ServiceName : AppFabricCachingService
HighWatermark : 99
LowWatermark : 90
IsLeadHost : True

Let’s describe the meaning of each of these properties:

  • Status: If the host responds to a standard ICMP ping within 10 seconds, this reports the status of the AppFabric service on that server. May be: {Up, Down, Starting, Stopping, ShuttingDown, Unknown}. This is the output from Get-AFCacheHostStatus.
  • HostName and ServiceName: Host and service name.
  • CachePort: The main port for public (external) communication with the cache host and cluster. Must be open to incoming client traffic.
  • ClusterPort, ArbitrationPort, and ReplicationPort: Used for internal data management communication amongst the hosts in the cluster. Must be open between servers.
  • Size: The amount of memory in MB to be allocated for live cached items. Note that actual memory used by the process will be significantly greater than this amount, and will be discussed later.
  • LowWatermark: Percentage of memory usage (from Size) when *expired* items are removed (evicted) from cache if expiration is enabled.
  • HighWatermark: Percentage of memory usage (from Size) when *all* items may be removed (evicted) from cache if eviction is enabled.
  • IsLeadHost: Whether this host is a lead host for cluster management. Not used in SharePoint.

Now that we’ve briefly discussed each of these properties, let’s dive deeper into their implications.

Eviction, Expiration, and Watermarks

The time has come to explain eviction, expiration, and watermarks. In a nutshell, the goal of the AppFabric service is to maintain the memory a cache host uses to store cached items between the low watermark and high watermark configured for that cache host. Every one second (by default), current memory usage for the host is checked against the Size and computed high and low watermark values for the host; based on the results, the following memory management algorithm is implemented.

  • Low Watermark not yet reached. No items are removed from the cache, even if expired.
  • Low Watermark reached, High Watermark not reached. Expired items are evicted, but non-expired items are not.
  • High Watermark reached. Expired and non-expired items are evicted until low watermark is reached.

This is shown graphically here:

If and when less than 15% (by default) of server memory remains, an eviction run is initiated regardless of the local cache host’s watermark and size settings. That is, even though the host is not using all of its allowed memory, if available memory on the server is below 15% of the total physical memory, a full eviction run will begin, as if the high watermark had been passed.

Note that caches specify whether they will be subject to expiration and eviction via the IsExpirable and EvictionType properties, as discussed in the previous section. If either of these excludes the cache from eviction, corresponding cached items will not be removed. For example, if IsExpirable is True and EvictionType is None (as it is for the Activity Feed LMT cache in SharePoint), expired items will be removed once the cache reaches its low watermark, but non-expired items will never be removed. As a result, if no items in the cache are expired, nothing will be removed.

If throttling is enabled (see below) the cache would eventually be write-throttled and no further items would be added till some items expired or were removed. However, throttling is not enabled by default in AppFabric for Windows Server and *I believe* that in this case the cache will continue to grow beyond its allotted size, governed only by the algorithm described above. Test and consider this when planning your own cache and host configurations.

Cache Statistics

Since the current amount of memory in use by the cache is so important, you’ll be interested in the commands which return a snapshot of current usage. As before, you have a couple options for retrieving information about all servers in the cluster, one using the Get-AFCacheHostStatus to iterate through all hosts, and one by finding all configured service instances in SharePoint. They are as follows:

PS:> Get-AFCacheHostStatus | % {
$ServerName = $_.HostName
Get-AFCacheStatistics -ComputerName $_.HostName -CachePort $_.PortNo | Add-Member -MemberType NoteProperty -Name 'ServerName' -Value $ServerName -PassThru
} | Format-List -Property * 

PS:> $SPDCServers = Get-SPServer | ? {($_.ServiceInstances | % TypeName) -contains 'Distributed Cache'} | % Address
PS:> $SPDCServers | % {
$ServerName = $_
Get-AFCacheStatistics -ComputerName $_ -CachePort 22233 | Add-Member -MemberType NoteProperty -Name 'ServerName' -Value $ServerName -PassThru
}

And typical output looks like this:

ServerName : SERVER09
Size : 51200
ItemCount : 18
RegionCount : 13
NamedCacheCount : 10
RequestCount : 393
MissCount : 66

Here, the SERVER09 cache host has allocated 51200 bytes (about 51K) to cached items. Since the allowed size for this cache host is much higher than 51K, neither expiration or eviction is necessary on this server, assuming at least 15% of the server’s physical memory is free.

Throttling

According the this MSDN article, AppFabric for Windows Server is subject to throttling based on percentage of server memory in use and percentage of AppFabric service memory in use. However, in my investigations I’ve found that throttling is disabled by default in AppFabric for Windows Server, and SharePoint does not change the defaults. This means that even though no memory from the cache host’s allowed size remains, or even if no memory on the entire server remains, the system will continue to try to serve requests and allocate memory for cached items.

I’ll continue to investigate Throttling and update this section if I find other information. Please let me know if your testing reveals different behavior than I’ve described.

With all this discussion of memory algorithms and calculations, you won’t be surprised to learn that memory overcommitment schemes employed by virtualization hypervisors, such as Hyper-V’s dynamic memory, are not supported with SharePoint, and not recommended with AppFabric in general.

SharePoint Cache Service Details

Now that we’re explained most aspects of AppFabric cache and cache host configuration, let’s explore some of the defaults used for SharePoint cache hosts, and some recommendations.

Cache Host Size for SharePoint Hosts

There are two points during setup of a Distributed Cache Service Instance on a SharePoint server when the local cache host configuration comes into play: at service installation (e.g. during Farm Join or when running Add-SPDistributedCacheServiceInstance) and at service provisioning (e.g. immediately following service installation, or when calling Start-SPServiceInstance).

At service installation time, the Size property for the local cache host is set to 5% of the total physical memory of the host. For example, if 16GB of physical RAM are installed on the host at installation time, the size of the local cache host will be set to 800MB. Note that this value could be different on each SharePoint server if they have different amounts of physical RAM at installation time. Note also that this value won’t automatically change if the amount of physical RAM allocated to the server changes.

At service provisioning time, SharePoint checks that the amount of available physical memory in the server is at least 100MB more than the allowed size for the cache host. So if, as in the above example, the cache host size is set to 800MB, at least 900MB of physical RAM must be available at provisioning time, or the service will fail to start.

These default values may not be appropriate for your environment. In the next section we’ll discuss options for changing them.

Changing Cache Host Size for SharePoint

There are two options for modifying cache host size for SharePoint AppFabric servers. Both require shutting down the entire cluster (all cache hosts).

The first utilizes AppFabric’s own PowerShell cmdlets. First, run Stop-AFCacheCluster to shut down all hosts in the cluster, then on each cache host run Set-AFCacheHostConfiguration -CacheSize <NewSizeInMB> to specify a new cache size. Don’t forget to run Start-AFCacheCluster to restart all hosts in the cluster. You can also specify different high and low watermarks with the Set-AFCacheHostConfiguration cmdlet.

Advantages of the native AppFabric cmdlet approach are that 1) all hosts in the cluster are stopped immediately (not gracefully), 2) you can specify different cache sizes for each host, and 3) you can configure low and high watermarks if necessary.

The second approach is to use SharePoint’s Update-SPDistributedCacheSize cmdlet. This takes only one parameter, -SizeInMB. It shuts down all hosts in the cache cluster, updates the cache size for each of them, then restarts them all.

Disadvantages of the SharePoint cmdlet are that it 1) shuts down all but the last host in the cluster gracefully and 2) sets all cache hosts to the same size. Graceful shutdown takes much longer than immediate shutdown, since all cached items must first be moved to a different host. Yet since the entire cluster is to be shut down, graceful shutdown is not helpful here.

Hopefully, Microsoft will address the graceful shutdown issue in the future, and setting all servers to the same memory size may be appropriate in your farm. If these items aren’t concerns for you, I’d recommend using the SharePoint-specific cmdlet, as this will always command more respect if and when you need support.

Whether you use AppFabric or SharePoint cmdlets to modify cache host size, note that if you uninstall and reinstall the Distributed Cache Service Instance on a server (i.e. by running Remove-SPDistributedCacheServiceInstance and then Add-SPDistributedCacheServiceInstance) the cache host size will be reset to the default (5% of physical memory at time of installation). If removing and adding the cache service instance is part of your maintenance cycles, make sure to also modify the cache size afterwards if needed.

We’ve discussed many concepts relevant to AppFabric infrastructure and planning. Now let’s focus on a couple specific applied details.

Planning Server Memory and Cache Host Size

A key detail in AppFabric Cache Service planning is that the actual memory usage of the DistributedCacheService process will be significantly larger than the size allocated in the cache host configuration. This MSDN article states that at least twice the amount of memory specified for the cache host will be used by the process due to memory management algorithms. Run this command for a quick review of how much memory the process is actually consuming:

PS:> Get-Process DistributedCacheService | fl *64*

At the time I ran this command, Get-AFCacheStatistics reported my cache host size as 9216 bytes. The process’s Working Set, however, was 584118272 bytes (almost 600MB). This is of course much more than twice the used amount of RAM; my assumption is that there is always a base level of memory overhead no matter how small the actual caches may be.

The key takeaway here is that if you have a certain amount of memory to allocate for AppFabric, the size specified for the host configuration should be half of that. For example, if I intend to allocate 16GB of physical memory for AppFabric, the size specified in the host configuration should be 8GB.

Also in that article, it’s recommended not to allocate more than 16GB for the AppFabric server (and corresponding 8GB for the cache host configuration). If the cache host’s size is larger than 16GB/8GB, garbage collection could take long enough to cause a noticeable interruption for clients.

A common recommendation is to spec AppFabric servers with 16GB of physical RAM, and set the cache host size to 7GB. With this arrangement, you can expect about 14GB to be used by the AppFabric process, leaving 2GB for other server processes on the host.

Let’s wrap up by discussing a couple additional considerations specifically relevant to SharePoint’s AppFabric implementation.

Other Considerations for AppFabric in SharePoint

High Availability

This one is easy - SharePoint (as ofMarch 2013) does not provide any high availability for its caches. As briefly discussed above, this means that each item and region in SharePoint’s named caches exists only once across all the memory in the cluster. If the server where that item has been stored in memory is lost or shut down ungracefully, that cached item will be lost. As discussed at the very beginning of this post, this is generally not a problem for cached items because they are authoritatively stored elsewhere. Nevertheless, there are a couple things to keep in mind.

First, retrieving cached items all over again involves a performance hit, the very hit the caches are intended to help avoid. There could be interruptions and delays while the caches are being refilled. For example, if the ActivityFeed cache is lost, users may not see all recent updates in their Newsfeed, or may see the “We’re still gathering the news” message as the cache is repopulated.

For the ActivityFeed and ActivityFeedLMT cache, there are two PowerShell cmdlets to manually begin repopulation of the caches before users actually request data. These are Update-SPRepopulateMicroblogLMTCache and Update-SPRepopulateMicroblogFeedCache. In situations where maintenance leads to loss of these caches, plan to run these cmdlets immediately afterwards to repopulate data manually.

A second concern when cached data in SharePoint is lost is that some items in SharePoint are *only* stored in the cache; specifically, updates regarding followed documents are only stored in the cache (as of March 2013). If these cached items are lost they won’t be able to be regenerated and will no longer appear in users’ feeds.

To avoid losing items from the cache and/or having to retrieve them again, you can use the Stop-SPDistributedCacheServiceInstance cmdlet with the -Graceful switch. This will move all cached items from the local cache host to other cache hosts in the cluster. For this to be effective, there must be space on the other servers to accommodate these items. Also note that if shutting down the entire cluster, such as to change the cache host size, there’s no way to avoid losing all of the caches and items. Plan accordingly.

Other Caches in SharePoint’s Deployment

One last detail is that Microsoft has stated that additional named caches should not be deployed to the SharePoint AppFabric cluster (i.e. by using the New-AFCache cmdlet). If you need a cache for a custom solution, you’ll need to deploy a separate AppFabric cluster (or server) and create the cache there. Then point your solution at the external AppFabric cluster. There’s also no supported way to add your own cached items to SharePoint’s named caches.

Conclusion

This concludes our presentation on AppFabric and SharePoint’s Distributed Cache Service. I hope it provides both SharePoint and AppFabric administrators with a deeper understanding and greater ability to manage distributed cache clusters.