Performance Quiz #6 -- Conclusion, Studying the Space

I thought I would end this series by looking at the space costs of both solutions because it gives a real window into what I do every day and perhaps it's helpful to understand the multidimensional aspects of performance analysis.  It's also interesting because space analysis tends to be more predictive (though not always) of the cold startup times which can be very interesting.   This is what I dream about at night -- scary thought that.

OK first lets compare the unoptimized managed to the optimized managed version looking just at managed allocations.

 Unoptimized Managed Versions

Total Allocations 135854 Objects 6662768 Bytes

Top Allocated Types

   Count    Bytes Type
  107841  5818642 System.String
   26726   534520 NS.Test.DictionaryEntry
      26   271292 System.Object []
    1018    20360 System.Text.StringBuilder

Optimized Managed Version

Total Allocations 107361 Objects 4045306 Bytes

Top Allocated Types

   Count    Bytes Type
   80351  3139080 System.String
   26726   534520 NS.Test.DictionaryEntry
      26   271292 System.Object []

OK there you go in the optimized version we can see that we allocated only 80351 strings which is basically the minimum of 3 strings per line of the file.  In the unoptimized version we were allocating a string for each line before we parsed it, so the total number of strings is 4 times the number of lines.  This approach of considering the amount of work the code does per unit of work in the input is a great technique and it's something you'll see in a lot of my analyses.  That was a good bit of savings.

Another thing you'll notice is that there were 1018 Stringbuilders in the original example which are gone in the new code.  And you may ask yourself:  Stringbuilders?  We didn't allocate any steenkin Stringbuilders.  Why did we get any Stringbuilders?  My guess (and I haven't confirmed this but I think it's a good guess) is that when the line reader finds a partial line at the end of its current buffer it stores that line in a Stringbuilder and then appends the next piece from the next buffer to it.  I suspect it uses a Stringbuilder because it doesn't know in advance how many partial reads it will have to do to make a line.  In our case we emptied the buffer 1018 times -- that buffer must be about 1k big (1024*1018 would be quite close to our file size).  Not to worry though, that was clearly not a source of performance problems but it's neat to see it pop up here to say hello :)

Both of these runs did 3 garbage collections during the course of the run so the GC did some real work here and we still got zippy performance.  Three cheers for good locality :)

Now lets look at the space situation as compared to the unmanaged code.  And here is where the CLR is really going to take its lumps.  I've been painting a very rosy picture so far, and I think for good reason, but it's time for me to get critical of my own stuff.  If you want to complain about managed systems this is what to complain about.  Better to hear it from me than from a stranger.

These reports were generated by post-processing the output of vadump into nice columns.  The expanded module list includes only the top 10 modules that added new code to the managed solution

 Kilobytes by        Unmanaged         Managed           Difference<br>Category         Total  Private    Total  Private    Total  Private

Page Table          24       24      100      100       76       76
Other System        24       24       28       28        4        4
Code/Static        456       44     3232      320     2776      276
Heap                92       92      284      284      192      192
Stack                8        8       28       28       20       20
Teb                  4        4       12       12        8        8
Mapped Data        140        0      304        0      164        0
Other Data        1852     1848     4172     4168     2320     2320

Total Modules      456       44     3232      320     2776      276
Total Dynamic     2096     1952     4800     4492     2704     2540
Total System        48       48      128      128       80       80
Grand Total       2600     2044     8160     4940     5560     2896

mscorwks.dll         -        -      928       48      928       48
mscorlib.ni.dll      -        -      464       40      464       40
msvcr80.dll          -        -      232       32      232       32
shell32.dll          -        -      200       36      200       36
msvcrt.dll           -        -      140       16      140       16
ole32.dll            -        -      132       16      132       16
shlwapi.dll          -        -      116        8      116        8
mscoree.dll          -        -       80       16       80       16
user32.dll           -        -       76        8       76        8
advapi32.dll         -        -       64        4       64        4

So, the summary is that the unmanaged code solved this problem in about 2.6M total wheras the managed code used a total of about 8.1M.   That's the real cost of managed code -- it's measured in space.  The total marginal cost is about 5.5M (highlighted on the grand total line).   And it's coming from two places.

Source #1: "Other Data"

"Other Data" is where the GC heap is.  The GC heap gets its performance by doing things in a lazy fashion.  No need to reclaim every byte as it is freed, instead let some junk accumulate and when there's a goodly amount of junk, but not too much, come along and clean it up.  It's economical and it results in a great marginal cost per byte (even including the cleanup time) as this example is showing.   However it does mean that at any given moment there will be more memory commited than is strictly necessary to solve the problem.  You see that price here.  

The other part of the price is more hidden but let me reveal it:  we did allocate the correct number of strings but ... dramatic pause ... that file is only about 1.1M remember?  Even if it was all single byte characters (which it isn't) converting it to unicode should have taken about 2.2M --- however we allocated 3.1M of strings in the optimized version...  Whoa... 3.1M of strings?  Why did we do that?  Well, managed strings include the usual object header stuff so that you can lock them for instance -- yes you can write lock(strFoo) and it works -- *grumble* -- and you pay a price for that.  Then there's the methodtable pointer because our strings are first-class objects unlike the C style strings Raymond used.  And our strings are length delimited not null.  All of that means overhead per string and the strings in this example are sufficiently small on average that it's adding up to about 50%.  Blecch.

I could target a lot of that overhead by using an index based approach if it mattered.  Reading into byte buffers and keeping offsets.  You could do the same in the unmanaged of course.  I didn't choose to do that at this point because I liked the object oriented nature of what I had and besides which I'm sick of this example.  But if it mattered you could fix it.  For a bigger dictionary it might be worth it.

Source #2: Modules

I listed the modules which had the greatest incremental contribution to the space and not-surprisingly they are mscorwks.dll and mscorlib.ni.dll (that's the prejitted image of mscorlib).  So it's basically the CLR that's the biggest size.  Why so much CLR code for such a simple program?  There were clues in the previous performance profiles:  We do a lot at startup, and in fact some of those other modules are there for the same reasons.   Some things that we do:

  • prepare for any security that might be required, at minimum validate that this is a full trust scenario
  • read security policy files to confirm the above (or consult caches)
  • check for the presence of assorted registry keys to configure the CLR
  • initialize performance counters to publish interesting statistics about this program
  • initialize OLE
  • find some common user paths such as current correct <Documents and Settings/The User> location, that requires shell32.dll

Those things are a great idea for a real application but not so clever for something as small as this reader.  I mean did we *really* need to create performance counters for this app?  I think that's probably overkill but we do it as a matter of course.

Mitigating Factors

The biggest mitigating factor comes from looking at the Private column. Examining the above you'll see that we only added 2896k of private memory.  That means the bulk of those extra modules we loaded were in fact sharable between programs.   We like shared bytes because we don't need additional physical memory to hold them -- we can reuse the same physical memory in multiple processes.  Looking closer, we in fact already know that 2320k of the private growth was the larger strings plus normal GC overshoot.   That leaves 576k of added private bytes we can target for eradication. 

Anyone want to guess what my #1 job is performance wise?  Yup you got it in one:  Find ways to reduce the number of private bytes in managed applications.  We've come a long way on this score; in v1.1 managed code modules used to be around 40% private bytes. In Whidbey we're about 10% private, or 90% shareable (see the numbers for mscorlib.ni above for an example).

Conclusion

Even with all these overheads we did a pretty darn good job right out of the gate.  Remember we turned in a fabulous bottom line with zero effort even including the downsides mentioned above.

What can you learn from this?  I think the main lesson is that you get the best "deal" from the CLR when you give it a meaty amount of work to do.  For very small programs, if you don't mind a few megabytes of overhead you're fine but if you need to be tiny you'll be disappointed.  For large existing programs, you'll be very sad indeed if you add managed code to your program to do just one very small thing -- it's not a good deal. 

If you're introducing managed code, give us something meaty and you'll be much more pleased.  Or else use the managed code at a time when space is not at a premium.  The worst thing you could possibly do is add managed code to an already long startup sequence and have it do just a wee little something that's inconvenient to do in unmanaged code.  Benchmarkers keep this in mind, and I've I said this before but it bears repeating: If you want to make this framework  look bad, or any framework for that matter, then benchmark it doing almost nothing.  You can get all the badness you like that way, but it's vacuous badness.

If you instead keep these parameters in mind and use the system thoughfully you can have great success -- often with comparative ease!  Personally I think this Dictionary reader, the managed version, was a tremendous success.

Thanks for reading and I hope you've enjoyed this series.