Caching: What could go wrong?


There’s an old saying about regular expressions that I’ve always liked. I think it applies equally to caching. Here’s my version.

Some people, when confronted with a performance problem, think “I know, I’ll add a cache.” Now they have two problems.

It’s so tempting to address a performance problem by adding a cache. After all, classes like Dictionary are very easy to use. Caching as a solution to the performance/latency/throughput problems means there is more complexity, which will lead to more bugs. Bugs with caches can be subtle and difficult to debug, and bugs with caches can also cause live site outages.

We’ve had a number of live site incidents (LSIs) over last couple of years for which the root cause is a bug in caching. I asked one of our engineers to write a document with examples of the caching bugs that we’ve found in our code. I wanted to share it in the hopes that it will help folks think twice before solving a problem with a “simple” cache.

The case of the fixed size identity cache

When we wrote the Identity Management System (IMS) cache, our most recently released product was TFS 2008 – we were all trying to get through the multi-collection support (also known as Enterprise TFS Management (ETM)). We had some issues with IMS performance and we decided to cache the most commonly identities in memory to avoid round-trips to expensive stored procedures and we decided to use the MemoryCacheList<T> which is basically a dictionary with a clock replacement algorithm. Of course, one important thing to do when designing a cache is to figure out how big it may grow – unbounded caches are one source of out of memory errors (see case of the registry service). When we were creating the service, we picked a number that sounded good at the time for something where we expected to have a lot of hosts (accounts) in memory at once and that weren’t likely to have a huge number of group memberships: 1024.

           // set the cache value to 10000 for on-premise and 1024 for hosted.
           Int32 defaultValue = 1024;
           if (systemRequestContext.ExecutionEnvironment.IsOnPremisesDeployment)
           {
               defaultValue = 10000;
           }
           else if (systemRequestContext.ServiceHost.HostType.HasFlag(TeamFoundationHostType.Deployment))
           {
               defaultValue = 100000;
           }

The code was simple – here’s the Read case:

           lock (m_getParentsLock)
           {
               isMember = identityCache.IsMember(groupDescriptor, descriptor);
               if (forceCacheReload || isMember == null)
               {
                   IncrementCacheMissPerfCounters(performanceService);
                   Identity memberWithParents = ReadIdentityFromDatabase(
                       requestContext,
                       m_masterDomain,
                       descriptor,
                       QueryMembership.Expanded, [..]
               }
               else
               {
                   IncrementCacheHitPerfCounters(performanceService);
               }
           }

That looks quite reasonable. We’re locking the cache (because we don’t want it to change while we’re reading it). If we find the data, we return it, and if we don’t, we go to the database. Cache misses should be rare in steady state so we should quickly build up the cache and everything will be fine.

The issue is that if the host in question has more than 1024 groups memberships active at any given time (which isn’t hard to hit since you multiply active identities with active groups), the cache will start thrashing (i.e. adding/removing constantly). When that happens, it results in frequent database calls, which can be fast or slow depending on the health of the database. This means that the maximum throughput of this cache in that state is:

1/(database call duration) queries per second = 1 / (10ms ) = 100 queries/second

That’s assuming 10ms average database call – so any variation of performance of that call will cause the throughput to go down and if that throughput goes below the average use of the cache, requests will start queuing. That happened with an account that had a lot of groups, and when it did, it took that account down.

Lessons Learned

  • Fixed sized caches should have telemetry to show:
    Cache hit ratio
    Eviction Rate
    Average Size
  • Cache Misses should not block readers (as they do above) – any longer than the actual updating of the cache in memory (do not call the database while holding a lock)
  • Locking the entire cache isn’t usually desirable – if the underlying data is partitionable, the cache should be partitioned and the lock protecting it as well.

The case of the Registry Service Cache

The registry service also has its roots in the ETM work – we needed to support multiple machines serving as web front ends, and the Windows Registry, being local to a single machine, couldn’t satisfy that requirement. We decided to add a Windows Registry-like service whose data would be stored in SQL Server. We also added a notification system so that individual machines would get notified when a registry key/value changed. When we ported TFS to Azure, we realized that a lot of our SQL chatter was just reading registry settings that typically don’t change often which makes them prime candidates for caching.

There are a few difficulties associated with caching registry keys:

  • The registry is hierarchical – the API supports reading sub-trees like /Services/*
  • There can be a lot of keys under certain hives (like the user hive which stores per-user settings) that aren’t read often.

Version 1.0 of the Registry caching solved those issues by:

  • Using a SparseTree<T> which supports tree operations natively
  • Caching only certain roots

That caused two problems:

  • SparseTree<T> doesn’t perform well for some common registry access patterns because it does a lot of string operations, like splitting and reassembling strings, allocating a lot of memory in the process
  • We kept having to add more hives to the cached list as we found cache misses frequently

The Feature Availability service (aka feature flags), which makes extensive use of the Registry service, was causing a lot of high CPU incidents and we were under pressure to make a quick fix so, so a few people got together one winter night and stated the following:

  • The SparseTree implementation was hard to fix (and it has a lot of other services dependent upon it) and hard to replicate (the tree operations)
  • The Feature Availability Service does only full key reads (i.e. no recursive reads)
  • It’s much better to let the cache figure out what to cache based on needs rather than arbitrarily deciding at compile time

So we decided the following:

  1. Add a ConcurrentDictionary<String, String> on top of the existing SparseTree to cache single key reads (no locking, no string operations!)
  2. Cache “misses” so we don’t have to go to the SparseTree for non-existing keys
  3. Update the ConcurrentDictionary with the same SQL notifications used to update the SparseTree
  4. Keep the same (no eviction policy) of the current registry cache
  5. Go to the SparseTree for all recursive operations

Of course, 2 & 4 aren’t compatible. So after a few days, the ConcurrentDictionary got bloated with all sorts of hypothetical keys that could have existed but didn’t.

Lessons Learned

  • Don’t add a broken cache on top of another broken cache – fix the first one instead.
  • Caching cache misses is a very dangerous business.
  • Always make sure you have a limit on how big your cache can get
  • Make sure you understand the access patterns your cache will need to support efficiently (we thought we did…)

Cases of bad hash algorithms

These are classic problems with hash tables in general and Dictionary<T> in particular – they are very clever data structures that rely on a good hash function.

The case of the Registry Service Callback

The registry service supports a notification mechanism that lets callers provide a callback when a certain registry key (or hive) changes. It uses a Dictionary< RegistrySettingsChangedCallback, RegistryCallbackEntry> to keep track of the interested parties – that lets us un-register them quickly and efficiently – or so we thought…

But how exactly does the CLR determine the hash code of a callback?

   COUNT_T hash = HashStringA(moduleName);                 // Start the hash with the Module name           
   hash = HashCOUNT_T(hash, HashStringA(className));       // Hash in the name of the Class name
   hash = HashCOUNT_T(hash, HashStringA(methodName));      // Hash in the name of the Method name

The hashcode is basically a combination of the module, the class and the method. What could possibly go wrong?

One thing that’s missing from the above computation is the object itself – the callback is a combination of a method, identified by name/class/module and a target (the instance of the object). In our case, our callers were mostly the same method, but with different objects. The results was that our beautiful dictionary became a small number of giant linked lists since every hash value came from a small set (e.g., only about 10 unique hash values with thousands of entries each).

The case of the WIT Node Id Cache

Work Item Tracking had a cache of classification nodes (indexed by the node ID which was an Int32), and everything worked like a charm and everyone was happy. A little while later, during the project model conversion work for team project rename, it was necessary to add a Guid to fully qualify the node id. The node was changed from a Int32 to a Struct with both the Guid of the project and the node id, like so:

struct ClassificationNodeId

       {
           public Guid DataspaceId;
           public int NodeId;
       }

That seemed quite efficient and everything worked properly during testing, and even after the code was in production on nearly all of the TFS Scale Units it seemed to work well. When the scale unit with our largest account was upgraded, the CPU was pegged on all of the application tiers.

image

Note that the deployment was rolled back about 10 minutes after the first attempt. After getting the right instrumentation, we noticed that the CPU came from adding to a dictionary.

Question: What’s the default hash code implementation on a struct? Answer after the break!

The answer is actually quite complex. It depends on whether the struct has reference types embedded in it (like a String), and in this case, the hash code is the hashcode of the struct type combined with the hash code of the first non-static field. In this case, the GUID is a project GUID. In an account with a small number of projects but a large number of area paths, the result was a few unique hash values and thousands of entries that hashed to one of those few has values. Again, a Dictionary-based hash was turned into a list.

From the implementation of struct:

       /*=================================GetHashCode======================================
       **Action: Our algorithm for returning the hashcode is a little bit complex.  We look
       **        for the first non-static field and get it’s hashcode.  If the type has no
       **        non-static fields, we return the hashcode of the type.  We can’t take the
       **        hashcode of a static member because if that member is of the same type as
       **        the original type, we’ll end up in an infinite loop.
       **Returns: The hashcode for the type.
       **Arguments: None.
       **Exceptions: None.
       ===================================================================================*/
       [MethodImplAttribute(MethodImplOptions.InternalCall)]
       public extern override int GetHashCode();

The other problem is that the default Equals implementation uses reflection to do the comparison (memcmp isn’t enough if you have reference types like strings).

The case of the mostly case-sensitive Identity Descriptor

A lot of the APIs in IMS use IdentityDescriptors which are basically a Tuple<String,String>, and we want those strings to be case insensitive (because Joe Bob is the same as joe bob). This time, we actually wrote a Comparer to get the right behavior:

   public int Compare(IdentityDescriptor x, IdentityDescriptor y)
   {
       Int32 retValue = 0;
       if ((retValue = VssStringComparer.IdentityDescriptor.Compare(x.Identifier, y.Identifier)) != 0)
       {
           return retValue;
       }

       return VssStringComparer.IdentityDescriptor.Compare(x.IdentityType, y.IdentityType);
   }

   public bool Equals(IdentityDescriptor x, IdentityDescriptor y)
   {
       return Compare(x, y) == 0;
   }

   public int GetHashCode(IdentityDescriptor obj)
   {
       return obj.IdentityType.GetHashCode() + obj.Identifier.GetHashCode();
   }

What’s wrong with this code? Well, MSDN clearly states: “If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two objects do not have to return different values.”

Now we can plainly see that Compare(“Joe Bob”, “joe bob”) will return true, but what about their hash codes? That’s easy! It’s the sum of the two hash codes of the two strings. From looking at the definition for String.GetHashCode() we see:

   int hash1 = (5381 << 16) + 5381;
   int hash2 = hash1;
   unsafe
   {
       fixed (char* src = this)
       {
           // 32 bit machines.
           int* pint = (int*)src;
           int len = this.Length;
           while (len > 2)
           {
               hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
               hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
               pint += 2;
               len -= 4;
           }
           if (len > 0)
           {
               hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
           }
       }
   }

Clearly, that hash code is case sensitive. Now our dictionary is a giant cache miss machine, because we’ll have multiple entries for the same value in different buckets.

Lessons Learned

  • If you use a Dictionary<K,V> – make sure K has a good hash code and that your data is well distributed: bad hashing doesn’t affect correctness, so don’t think that it’s fine because it doesn’t throw.
  • When in doubt, pass a comparer to the Dictionary
  • Check your performance with the data you expect
  • Make sure you override GetHashCode and Equals properly

Caching Check List

  • Why is the data being requested so frequently and is that behavior necessary?
  • Do you really need to cache this data? Could you change your code not to require it?
  • How often does the underlying data change? Are you fine if your cache is out of date?
  • How much data will you need to store in memory?
  • What eviction policy will you have?
  • Have you fully understood the access patterns the cache will need to support?
  • What hit rate do you expect?
  • Do you have telemetry to know whether your cache is operating correctly in production?

Follow me at twitter.com/tfsbuck

Comments (2)

  1. Dave Shaw says:

    Nice to see some real world problems from a real system at scale. I've seen the struct issue before on a much smaller scale, we were caching a few 1,000 entries from disk when loading an App for reference later. We then needed to go from one String as the key to two, enter the struct, exit perf, loading went took about 40 times longer with structs over Strings as the key. In the end I just concatted the 2 strings as the key because the domain guaranteed uniqueness and everything went back to normal.

    The other caching issue I've seen are:

    a) poor management of concurrency, Dictionary, HashSet, etc can get in an infinite loop if you encounter certain race conditions and peg your CPU. I've seen them in the system I work on, and even logged a TFS bug against 2012.4 where Dictionaries were been taken out of Concurrent Dictionaries and used without full thread safety.

    b) caching without an invalidation option, loading values from the database that rarely ever change – IP's of other systems for example – until there is problem and you realise you have to recycle the services to pick up a new value. This is very common problem where I work.

  2. Buck Hodges says:

    Dave, yeah, the struct one is subtle. We've hit the other bugs you mentioned as well.

Skip to main content