Reducing Memory Usage to Increase Concurrency

There are multiple ways for a computer to solve permutation problems. In an exercise to show the performance of different solutions, I created a couple of different ways to solve the problem of knowing if a certain sequence could be transformed to another sequence given a set of legal permutations. All of my little puzzle solving applications take a sequence of numbers 0 to 8, and see if following valid permutation rules, the given sequence can be ordered. The amount of possible permutations for a sequence of nine numbers is 9! (362880). Depending upon what the valid permutations are, it can be impossible to transform one sequence to another sequence. The way my applications would solve this is by calculating all possible legal permutations of the given sequence. What makes this problem eligible for a concurrent solution is that a sequence may be able to be transformed into multiple other sequences at any given step, and all of those sequences can be analyzed independent of each other (i.e. concurrently).

For some of the “solutions”, the bottlenecks were practically intentional. I would run the application under the Concurrency Visualizer, look at places where a thread was blocked, and sure enough, it was blocked on a lock that is in my application’s code. That was the purpose of the exercise though: to practice analyzing performance sessions to find the application’s bottle-necks. The Concurrency Visualizer did help me find a contention which I wasn’t aware of.

One of my solutions was practically lock free. The application did very little work inside of the critical sections and should have efficiently utilized the CPU. For the most part it did. But the CPU utilization view of the Concurrency Visualizer showed that the application was spiking instead of exhibiting a near constant usage, which I would expect of a mostly lock free application. Looking at the Visualizer’s Threads view there was a recognizable pattern of having all of the threads, but one, stuck in contention and all waiting on the single running thread. I wasn’t too surprised, but I checked the Current stack tab of the blocked threads to double check that the critical section was one in the applications code. It never was.

All of the blocked critical sections looked something like

clr.dll!CLREvent::WaitEx

clr.dll!WKS::gc_heap::wait_for_gc_done

clr.dll!WKS::gc_heap::try_allocate_more_space

clr.dll!WKS::GCHeap::Alloc

clr.dll!AllocateArrayEx

clr.dll!JIT_NewArr1

mscorlib.dll!System.Collections.Generic.List`1[System.__Canon]..ctor

numberpuzzle.dll!NumberPuzzle.NumberPuzzleBase.EvaluateState

Okay, so Garbage Collector has a lock. There’s no surprise there, the GC has to manage memory. The frequency in which the GC performs operations is based off of the number of objects allocated, so perhaps there might be an object being allocated that I’m not aware of which is causing an increase of GC work. To find out, I ran the Visual Studio performance wizard, but this time instead of selecting the option to “Visualize the behavior of a multithreaded application”, I chose the option to “Track managed memory allocation”. After a few runs I did discover something. The second most allocated object was the System.Thread.WaitCallback object. This was a discovery because the application code never created a WaitCallback object. But the profiler told me that on average, the application allocated 88373454 bytes with %12.8 of the bytes being WaitCallback objects. Thus, I couldn’t account for on eighth of the allocated objects.

All of those allocations were coming from one line:

ThreadPool.QueueUserWorkItem(EvalWrapper, state);

What is happening at this point in the application is that it has discovered that there is one more sequence permutation to evaluate and it’s adding that work to the ThreadPool. The documentation for QueueUserWorkItem shows that it takes a WaitCallback object in addition to another object. But EvalWrapper is a method with a method signature of

protected void EvalWrapper(Object obj)

, not a WaitCallback. So what’s going on here? The compiler sees that I want to call the EvalWrapper method and since the signature for EvalWrapper matches the signature of WaitCallback, the compiler is instantiating a WaitCallback instance on the applications behalf and passing it along to QueueUserWorkItem. And it’s allocating this WaitCallback every time.

With this new information in hand, I modify the application to create a WaitCallback member instance for this class (which takes EvalWrapper as a parameter) and then change the call to QueueUserWorkItem to use the member instance. After running the application through the memory profiler again the application now allocates on average 78214686 bytes. Since there’s noticeably less memory allocated, there’s less work for the Garbage Collector to do, which leaves more CPU time for the application and increases concurrency.

Jared Van Leeuwen – Parallel Computing Platform