Rico’s Performance Quiz Vs. The F1 Profiler

The Quiz:

 

Recently Ricom posted a performance quiz on his blog asking about the performance scenario difference between the two following code fragments. The first fragment uses generics while the second uses an arraylist.

 

            Now, I’m not a performance engineer like Rico, but I figured that this would be a fun scenario to check out using our profiler. So I fired up a quick console application (shown below) that would run one or the other of the code sections, I just commented out the call that I wasn’t using for each run. I wasn’t going hardcore about getting the session perfect, but I did take a stab at shutting down the services that hog the most resources, I’m looking at you MSNIndex!

 

 

            I ran each scenario twice in sampling mode (comparing the results to make sure that they were similar) and I also ran each one once in sampling mode with object allocation and lifetime profiling turned on. If you are trying this yourself, note that the object allocation run using the arraylist take a loooong time to run and to analyze, I probably should have cut down on the loop size for that run. I’ll explain why this is the case when I get to the results.

 

The Results:

            The first thing that I looked at was the overall number of samples that were taken in the non-allocation runs. Since I set the profile to take a sample every 10000000 clock cycles, the number of samples can give us an pretty good estimate as to how long we are running. The breakdown of samples (I also added some debug version numbers for comparison) is as follows:

 

Using generics(debug): 828 total samples

Using generics(release): 556 total samples

Using arraylist(debug): 1998 total samples

Using arraylist(release): 1907 total samples

 

            Looking at these numbers you can see that at least in terms of wall clock speed, we are getting a pretty big performance gain by using generics. We would expect this as we do not have to box the integers coming into and out of our list when we use generics. However there is a code size tradeoff when using generics.

 

            The next thing that I wanted to show is where each implementation was spending most of its execution time. Pictured below, I have listed all the functions (actually I truncated the listed to avoid having to show all the little functions) for which I have samples, and have sorted them in order of exclusive time (exclusive is time spent in just that function, inclusive time includes all the functions children). The top screenshot is from the generics run and the bottom screenshot is from the arraylist run.

 

 

Now the arraylist results: 

 

 

            When just looking that the percentages, it looks like the two implementations spend about the same amount of time doing the same things. Both spend the most time calling Enumerator.GetNext, as well as spending a lot of time in the main function (GenericTest and ArrayListTest). You’ll notice that the arraylist implementation has CLRStubOrUnknownAddress as the second highest function. This is the function that we show on a native to managed transition when we cannot resolve what function we are in. Since I’m interested in finding out why we have so many samples in CLRStubs; I right-clicked on this function to show functions that were called by “CLRStubOrUnknownAddress.” This jumped me over to caller / callee view (shown below) with CLRStubOrUnknownAddress selected as the root function. 

 

 

            Looking at this list we see that with the arraylist method we got 466 samples in CLRStubs, mainly from ArrayListEnumeratorSimple.get_Current function (154 samples from this function) and from the ArrayListTest function (285 samples from this function). While I don’t know the exact cause, I assume that the arraylist is using some quick transitions down to native code to run faster, and in such a tight loop, we end up seeing lots of the CLRStub calls (as these mark transitions to native).

 

            So we know that the generics implementation is running quite a bit faster and we know that both implementations are spending about the same percent of time in similar functions. Also, we have seen that the arraylist appears to be making some transitions to native code for a performance boost. So let us take a look at the object allocation reports to see what these two implementations are allocating. Shown below are the allocation summary reports with the generics implementation on top and the arraylist implementation on the bottom.

 

 

Now the arraylist results:

 

 

            Yikes! See those 10,000,000 enumerators allocated by the array list implementation? As I said, I’m not a performance engineer, but I do know that that is not a good thing to be doing. Looking back in the code, we see the foreach statement inside the 10,000,000 times loop in the arraylist code. Unlike the generics case, the arraylist has to allocate a new enumerator for the arraylist every time the foreach loop gets executed. So I guess that the moral of this performance quiz has less to do with generics and arraylists and more to do with being careful when using foreach loops.