Performance Quiz #6 — Conclusion, Studying the Space


I thought I would end this series by looking at the space costs of both solutions because it gives a real window into what I do every day and perhaps it’s helpful to understand the multidimensional aspects of performance analysis.  It’s also interesting because space analysis tends to be more predictive (though not always) of the cold startup times which can be very interesting.   This is what I dream about at night — scary thought that.


OK first lets compare the unoptimized managed to the optimized managed version looking just at managed allocations.

Unoptimized Managed Versions

Total Allocations 135854 Objects 6662768 Bytes

Top Allocated Types

Count Bytes Type
107841 5818642 System.String
26726 534520 NS.Test.DictionaryEntry
26 271292 System.Object []
1018 20360 System.Text.StringBuilder

Optimized Managed Version

Total Allocations 107361 Objects 4045306 Bytes

Top Allocated Types

Count Bytes Type
80351 3139080 System.String
26726 534520 NS.Test.DictionaryEntry
26 271292 System.Object []


OK there you go in the optimized version we can see that we allocated only 80351 strings which is basically the minimum of 3 strings per line of the file.  In the unoptimized version we were allocating a string for each line before we parsed it, so the total number of strings is 4 times the number of lines.  This approach of considering the amount of work the code does per unit of work in the input is a great technique and it’s something you’ll see in a lot of my analyses.  That was a good bit of savings.


Another thing you’ll notice is that there were 1018 Stringbuilders in the original example which are gone in the new code.  And you may ask yourself:  Stringbuilders?  We didn’t allocate any steenkin Stringbuilders.  Why did we get any Stringbuilders?  My guess (and I haven’t confirmed this but I think it’s a good guess) is that when the line reader finds a partial line at the end of its current buffer it stores that line in a Stringbuilder and then appends the next piece from the next buffer to it.  I suspect it uses a Stringbuilder because it doesn’t know in advance how many partial reads it will have to do to make a line.  In our case we emptied the buffer 1018 times — that buffer must be about 1k big (1024*1018 would be quite close to our file size).  Not to worry though, that was clearly not a source of performance problems but it’s neat to see it pop up here to say hello 🙂


Both of these runs did 3 garbage collections during the course of the run so the GC did some real work here and we still got zippy performance.  Three cheers for good locality 🙂

Now lets look at the space situation as compared to the unmanaged code.  And here is where the CLR is really going to take its lumps.  I’ve been painting a very rosy picture so far, and I think for good reason, but it’s time for me to get critical of my own stuff.  If you want to complain about managed systems this is what to complain about.  Better to hear it from me than from a stranger.


These reports were generated by post-processing the output of vadump into nice columns.  The expanded module list includes only the top 10 modules that added new code to the managed solution


Kilobytes by        Unmanaged         Managed           Difference
Category
Total Private Total Private Total Private

Page Table 24 24 100 100 76 76
Other System 24 24 28 28 4 4
Code/Static 456 44 3232 320 2776 276
Heap 92 92 284 284 192 192
Stack 8 8 28 28 20 20
Teb 4 4 12 12 8 8
Mapped Data 140 0 304 0 164 0
Other Data 1852 1848 4172 4168 2320 2320

Total Modules 456 44 3232 320 2776 276
Total Dynamic 2096 1952 4800 4492 2704 2540
Total System 48 48 128 128 80 80
Grand Total 2600 2044 8160 4940 5560 2896

mscorwks.dll – – 928 48 928 48
mscorlib.ni.dll – – 464 40 464 40
msvcr80.dll – – 232 32 232 32
shell32.dll – – 200 36 200 36
msvcrt.dll – – 140 16 140 16
ole32.dll – – 132 16 132 16
shlwapi.dll – – 116 8 116 8
mscoree.dll – – 80 16 80 16
user32.dll – – 76 8 76 8
advapi32.dll – – 64 4 64 4


So, the summary is that the unmanaged code solved this problem in about 2.6M total wheras the managed code used a total of about 8.1M.   That’s the real cost of managed code — it’s measured in space.  The total marginal cost is about 5.5M (highlighted on the grand total line).   And it’s coming from two places.


Source #1: “Other Data”


“Other Data” is where the GC heap is.  The GC heap gets its performance by doing things in a lazy fashion.  No need to reclaim every byte as it is freed, instead let some junk accumulate and when there’s a goodly amount of junk, but not too much, come along and clean it up.  It’s economical and it results in a great marginal cost per byte (even including the cleanup time) as this example is showing.   However it does mean that at any given moment there will be more memory commited than is strictly necessary to solve the problem.  You see that price here.  


The other part of the price is more hidden but let me reveal it:  we did allocate the correct number of strings but … dramatic pause … that file is only about 1.1M remember?  Even if it was all single byte characters (which it isn’t) converting it to unicode should have taken about 2.2M — however we allocated 3.1M of strings in the optimized version…  Whoa… 3.1M of strings?  Why did we do that?  Well, managed strings include the usual object header stuff so that you can lock them for instance — yes you can write lock(strFoo) and it works — *grumble* — and you pay a price for that.  Then there’s the methodtable pointer because our strings are first-class objects unlike the C style strings Raymond used.  And our strings are length delimited not null.  All of that means overhead per string and the strings in this example are sufficiently small on average that it’s adding up to about 50%.  Blecch.


I could target a lot of that overhead by using an index based approach if it mattered.  Reading into byte buffers and keeping offsets.  You could do the same in the unmanaged of course.  I didn’t choose to do that at this point because I liked the object oriented nature of what I had and besides which I’m sick of this example.  But if it mattered you could fix it.  For a bigger dictionary it might be worth it.


Source #2: Modules


I listed the modules which had the greatest incremental contribution to the space and not-surprisingly they are mscorwks.dll and mscorlib.ni.dll (that’s the prejitted image of mscorlib).  So it’s basically the CLR that’s the biggest size.  Why so much CLR code for such a simple program?  There were clues in the previous performance profiles:  We do a lot at startup, and in fact some of those other modules are there for the same reasons.   Some things that we do:



  • prepare for any security that might be required, at minimum validate that this is a full trust scenario
  • read security policy files to confirm the above (or consult caches)
  • check for the presence of assorted registry keys to configure the CLR
  • initialize performance counters to publish interesting statistics about this program
  • initialize OLE
  • find some common user paths such as current correct <Documents and Settings/The User> location, that requires shell32.dll

Those things are a great idea for a real application but not so clever for something as small as this reader.  I mean did we *really* need to create performance counters for this app?  I think that’s probably overkill but we do it as a matter of course.


Mitigating Factors


The biggest mitigating factor comes from looking at the Private column. Examining the above you’ll see that we only added 2896k of private memory.  That means the bulk of those extra modules we loaded were in fact sharable between programs.   We like shared bytes because we don’t need additional physical memory to hold them — we can reuse the same physical memory in multiple processes.  Looking closer, we in fact already know that 2320k of the private growth was the larger strings plus normal GC overshoot.   That leaves 576k of added private bytes we can target for eradication. 


Anyone want to guess what my #1 job is performance wise?  Yup you got it in one:  Find ways to reduce the number of private bytes in managed applications.  We’ve come a long way on this score; in v1.1 managed code modules used to be around 40% private bytes. In Whidbey we’re about 10% private, or 90% shareable (see the numbers for mscorlib.ni above for an example).


Conclusion


Even with all these overheads we did a pretty darn good job right out of the gate.  Remember we turned in a fabulous bottom line with zero effort even including the downsides mentioned above.


What can you learn from this?  I think the main lesson is that you get the best “deal” from the CLR when you give it a meaty amount of work to do.  For very small programs, if you don’t mind a few megabytes of overhead you’re fine but if you need to be tiny you’ll be disappointed.  For large existing programs, you’ll be very sad indeed if you add managed code to your program to do just one very small thing — it’s not a good deal. 


If you’re introducing managed code, give us something meaty and you’ll be much more pleased.  Or else use the managed code at a time when space is not at a premium.  The worst thing you could possibly do is add managed code to an already long startup sequence and have it do just a wee little something that’s inconvenient to do in unmanaged code.  Benchmarkers keep this in mind, and I’ve I said this before but it bears repeating: If you want to make this framework  look bad, or any framework for that matter, then benchmark it doing almost nothing.  You can get all the badness you like that way, but it’s vacuous badness.


If you instead keep these parameters in mind and use the system thoughfully you can have great success — often with comparative ease!  Personally I think this Dictionary reader, the managed version, was a tremendous success.


Thanks for reading and I hope you’ve enjoyed this series.

Comments (5)

  1. Thomas says:

    [Note: I accidently posted this same message in the previous blog about this, before I noticed this followup. Sorry for that.]

    Rico,

    I’m suprised that Raymond jumps down to the C level so quickly. He went straight to raw string pointers instead of just having a vector pointers to entries. He didnt consider using a boost memory pool to store said entries and then having the vector as an index sorta.

    All in all while I agree that the CLR is excellent at these memory allocation intensive scenarios, it seems to me that Raymond didn’t really harness the power of the STL and boost all that well.

  2. ricom says:

    No problem, I deleted the extra copy for you.

    When Raymond returns from vacation he might have some useful insights into why he did things the way that he did them. I have some guesses but it’s total speculation. Let’s see what he says 🙂

  3. Will Dean says:

    I expect that you’ve read this:

    http://www.sysinternals.com/blog/#111363926377819242

    In some ways, he could be said to be illustrating your point that doing nothing with the framework makes it look bad.

    However, as he points out, the badness seems to continue on up as the applications get more complex. Is this likely to be largely due to the significant per-object overhead?

    But maybe it doesn’t really matter…

  4. ricom says:

    As far as object overhead goes, compared to C++ it’s only 4 bytes more per object for normal objects. It may be less if you consider overhead imposed by your allocator.

    This example is particularly bad because the C style strings aren’t objects and they have the length stuff in managed code because they are variable length, and they’re small strings.

    But really, it’s never been a good idea to allocate lots of small objects.

    I could nix that overhead if I wanted to — keep the char buffers for instance and do basically what you’d do in C. I chose to eat the tax in this case.

    Whether or not you get a lot of overhead depends on if you build crazy object graphs or not 🙂

  5. Ever wonder how I produce nice textual allocation summaries like this one?

    This report shows allocations…