Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus

As the title suggests this post is going to be about a specific issue I came across at a customer site recently in a new deployment of SharePoint 2013 that relates to the distributed cache service. Definitely one of the more challenging ones to troubleshoot from what I have seen before so I figured I should capture the result here in case it helps someone else. So here is the situation. We had a SharePoint farm that had a number of web front end servers and we had chose to run the distributed cache service on 2 dedicated servers in the farm. In our initial testing we saw that the performance of SharePoint was well below what we would have expected for a farm of the scale of the one we had deployed, and we needed to get to the bottom of it. So in typical fashion I started with the developer dashboard to identify the slow loading part of the page, and was able to see the problem was related to the SharePoint claims provider in the authentication validation part of the page. Taking to the ULS logs to find out more detail, we were seeing a lot of the below error (I've truncated the stack trace because it does go on a fair way).

 Unexpected Exception in SPDistributedCachePointerWrapper::InitializeDataCacheFactory for usage 'DistributedLogonTokenCache'
 - Exception 'Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a 
temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy 
network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has 
been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache 
hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) 
---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing 
your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout 
was '10675199.02:48:05.4775807'. ---> System.IO.IOException: The read operation failed, see inner exception. 
---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing 
your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket 
timeout was '10675199.02:48:05.4775807'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by 
the remote host 
 at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags) 
 at System.ServiceModel.Channels.SocketConnection.ReadCore(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, Boolean closing) -
 -- End of inner exception stack trace --- 
 at System.ServiceModel.Channels.SocketConnection.ReadCore(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout, Boolean closing) 
 at System.ServiceModel.Channels.SocketConnection.Read(Byte[] buffer, Int32 offset, Int32 size, TimeSpan timeout) 
 at System.ServiceModel.Channels.ConnectionStream.Read(Byte[] buffer, Int32 offset, Int32 count) 
 at System.Net.FixedSizeReader.ReadPacket(Byte[] buffer, Int32 offset, Int32 count) 
 at System.Net.Security.NegotiateStream.StartFrameHeader(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) 
 at System.Net.Security.NegotiateStream.StartReading(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) 
 at System.Net.Security.NegotiateStream.ProcessRead(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest) -
 -- End of inner exception stack trace --- 
 

This was happening in line with pretty much every request for content in SharePoint, so represented a pretty big problem. Looking at the details in the exception it appeared to be that the communication was getting from the WFE servers to the cache servers correctly, but they were actively refusing the connection - leaving SharePoint without a cache, causing the performance issue.

After doing some reading up on the expectations of the AppFabric service, validating it was configuring correctly (and even recreating the cache from scratch) I was still seeing the problem. Spending some time talking with a couple of my colleagues though put me on to the right path - if the cache servers were actively refusing the connection there must be a reason for it, and permissions was the first thing that came to mind. So step one was to validate the permissions for the cache using the PowerShell command Get-CacheAllowedClientAccounts. This returned two group names, "WSS_WPG" and WSS_ADMIN_WPG". A quick look at these groups and I could see that the SharePoint farm account was in there, and also the account that was running my distributed cache service was in there as well. The glaring omision though - the account that runs the application pool for my SharePoint sites.

The solution, we took the account that runs the application pool for the SharePoint web applications and we added it to the WSS_WPG group on just the dedicated distributed cache servers. As soon as this was done the errors stopped in the ULS logs and we noticed that the page load times went from over 6000ms to less than 200ms - a pretty big difference! So there it is, hopefully that saves someone else a bit of time if you come across the same issue.