Performance Quiz #10 -- Thread local storage -- Solution

I actually posted quiz #10 quite a while ago but a comment with the correct solution came in so quickly that I wasn't very motivated to post a followup.  There are excellent links in the comments (thank you readers!)  But now I'll have to make the quizzes harder :)

The problem was to see what overhead is associated with various methods of creating flexible thread local storage.  I suggested two ways of having named storage.

I've posted a sample benchmark that expands on this and shows four different approaches (some less general than others). 

On my machine I observed the following times:

Test1: Named Slot 7,991ms
Test2: Numbered Slot 4,136ms
Test3: Thread-local dictionary 2,006ms
Test4: Thread-local direct 704ms

So, what's going on?  Well I looked into it with our profiler and got these results which show the extra costs pretty clearly.  Have a look at all the helper functions under Test1 and Test2. 

Exclusive  Inclusive  Function Name 
0.39 % 89.92 %
Quiz10.Program.Main (string[])
0.78 % 53.07 %
   Quiz10.Program.Test1 ()
0.95 % 25.19 %
  |  System.LocalDataStoreMgr.GetNamedDataSlot (string)
0.18 % 12.14 %
  | |  JIT_MonReliableEnter (class Object *,bool *)
5.76 % 8.06 %
  | |  System.Collections.Hashtable.get_Item (object)
3.05 % 3.11 %
  | |  @JIT_MonExitWorker@4
3.49 % 22.31 %
  |  NativeArrayMarshalerBase::NativeArrayMarshalerBase (class CleanupWorkList *)
0.43 % 5.97 %
  | |  ThreadStore::LockDLSHash (void)
0.14 % 5.41 %
  | |  CantAllocThreads::MarkThread (void)
0.04 % 2.80 %
  | |  EEHashTableBase<int,class EEIntHashTableHelper,0>::FindItem (int)
0.77 % 2.19 %
  | |  FrameWithCookie<class HelperMethodFrame_1OBJ>::FrameWithCookie<class HelperMethodFrame_1OBJ> (void *,struct LazyMachState *,unsigned int,class Object * *)
0.78 % 1.59 %
  |  System.Threading.Thread.get_LocalDataStoreManager ()
0.16 % 1.22 %
  |  ThreadNative::GetDomainLocalStore (void)
0.57 % 1.16 %
  |  System.LocalDataStore.GetData (class System.LocalDataStoreSlot)
0.66 % 26.72 %
   Quiz10.Program.Test2 ()
3.73 % 21.79 %
  |  NativeArrayMarshalerBase::NativeArrayMarshalerBase (class CleanupWorkList *)
0.46 % 5.79 %
  | |  ThreadStore::LockDLSHash (void)
0.18 % 5.13 %
  | |  CantAllocThreads::MarkThread (void)
0.05 % 3.13 %
  | |  EEHashTableBase<int,class EEIntHashTableHelper,0>::FindItem (int)
0.57 % 1.62 %
  | |  FrameWithCookie<class HelperMethodFrame_1OBJ>::FrameWithCookie<class HelperMethodFrame_1OBJ> (void *,struct LazyMachState *,unsigned int,class Object * *)
0.11 % 1.19 %
  |  ThreadNative::GetDomainLocalStore (void)
0.44 % 1.08 %
  |  System.Threading.Thread.get_LocalDataStoreManager ()
0.53 % 1.05 %
  |  System.LocalDataStore.GetData (class System.LocalDataStoreSlot)
0.25 % 8.43 %
   Quiz10.Program.Test3 ()
0.55 % 7.07 %
  |  System.Collections.Generic.Dictionary`2.get_Item (!0)
2.38 % 6.52 %
  |    System.Collections.Generic.Dictionary`2.FindEntry (!0)
0.20 % 1.30 %
   Quiz10.Program.Test4 ()

The table above is showing all functions starting from Main with an inclusive cost >= 1% and a depth of no more than 3 -- so things are missing but it's good for discussion. Under Test1 there's a good deal of Locking and Marshalling... looks like there is a big oops here. The good news is that the contract is sound so hopefully this could be addressed. But really I'm not sure why I would even bother.  The other approach, using [ThreadStatic] is much cleaner and much faster.  I don't know why anyone would ever want to use the slots.

For my part rather than fix this I think I will ask that the relevant functions be deprecated -- the [ThreadStatic] approach seems better in every way .   The slot methods hereby have my personal deprecation for what that's worth.