Performance Quiz #6 — Looking at the sixth cut

Well, it’s time for me to surrender.  Sort of :)

Raymond pulls out all the stops in his sixth version by painting a big bullseye on his biggest remaining source of slowness which is operator new.  He turns in an excellent result here.  On my benchmark machine I see the number drop from 124ms to 62ms — a full 2x faster from start to finish.  And observing the footnote on my previous message, the runtime for his application is now comparable to the CLR’s startup overhead…  I can’t beat this time.

Let’s look at the results table now to see how we ended up:

Version Execution Time
Unmanaged v1 1.328
Unmanaged v2 0.828
Unmanaged v3 0.343
Unmanaged v4 0.187
Unmanaged v5 With Bug 0.296
Unmanaged v5 Corrected 0.124
Unoptimized Managed port of v1    0.124
Optimized Managed port of v1 0.093
Unmanaged v6 0.062

Six versions and quite a bit of work later, we’ve been soundly trumped.  But before I discuss that, let me put up the internal profile of Raymond’s version 6

I’ve applied my usual filters to the call tree (nothing lower than 5% inclusive) and I also pruned out a couple of functions below HeapAlloc because they have long names and are boring :)

Function Name (Sanitized)     Exclusive
  _mainCRTStartup 0 97.826
    _main 0 97.826
      Dictionary::Dictionary(void) 5.435 96.739
        MultiByteToWideChar 19.565 25
          GetCPHashNode 5.435 5.435
        operator new(unsigned int) 1.087 16.304
          .. 0 14.13
            .. 0 14.13
              AllocateHeap 4.348 13.043
        _free 0 8.696
          FreeHeap 2.174 8.696
        DictionaryEntry::Parse(…) 1.087 33.696
          StringPool::AllocString(…) 2.174 27.174
            _lstrcpynW 19.565 25
              __SEH_prolog 5.435 5.435

You can see that the memory allocation time is way down as a percentage, and of course that’s a smaller percentage of a smaller total time.  I think he gets a lot of raw speed from his improved locality thanks to that new allocator as well.  Interestingly SEH overhead is up to a signifcant level in this run (now over 5% for the first time).  Still nothing to be worried about.

So am I ashamed by my crushing defeat?  Hardly.  The managed code got a very good result for hardly any effort. To defeat the managed Raymond had to:

  • Write his own file/io stuff
  • Write his own string class
  • Write his own allocator
  • Write his own international mapping

Of course he used available lower level libraries to do this, but that’s still a lot of work.  Can you call what’s left an STL program?  I don’t think so, I think he kept the std::vector class which ultimately was never a problem and he kept the find function.  Pretty much everything else is gone.

So, yup, you can definately beat the CLR.  Raymond can make his program go even faster I think.

Interestingly, the time to parse the file as reported by both programs internal timers is about the same — 30ms for each.  The difference is in the overhead.

Tomorrow I’m going to talk about the space used by these programs and that will wrap it up.  Though I think Raymond is going to go on and do some actual UI and so forth with this series.  That should be fun to watch.

Comments (14)

  1. mark says:

    Thanks. This has been an excellent series, Rico. Showing how powerful the CLR is, as well as showing some concrete examples of how to measure and improve within its walls. I think this series needs to become a cohesive article for MSDN.

  2. Great series of posts. It was wonderful to see the managed code perform so well right out of the gate. And, your subsequent optimization was simple and effective. Thanks for taking the time and sharing this with us.

  3. C++ guy says:

    Great series of posts. The CLR is very impressive.

  4. Raymond Chen (aka "fixed more Windows bugs than you’ve had hot dinners") and Rico Mariani (aka "Mr .NET…

  5. Nemanja Trifunovic says:

    >Six versions and quite a bit of work later, >we’ve been soundly trumped

    Excuse me, but this is not fair. Raymond has never tried to beat any "managed" version (I bet he doesn’t even know about your effort – you started it after he went to vacation) – all he wanted to prove is that to achieve optimization you need to measure and not to gues where the bottleneck is.

  6. ricom says:

    I assure you Raymond doesn’t need any defense :)

    Remember I work with Raymond, we talked about this series of postings at some length quite some time ago and then again when this got to the top of his queue.

    It’s all supposed to be an educational illustration. The competitive language is just a facade to make it less of a dry read.

    The outcome was never really in question, the journey is the illustration.

    Actually I think the fact that Raymond just did what he thought was best (well in advance of my comments) at each stage makes it more interesting, not less.

  7. So I was reading through one of my favorite MSDN blogs (

    And he…

  8. Keith J. Farmer says:

    So — excluding startup time — what’s the performance comparison like?

    It seems to me that (total – startup) ~ 0.03s, about half of the total time for v6. If startup for v6 is ~0.00s, then the actual working bits still seem twice as fast in the CLR.

  9. ricom says:

    No, they each do the job in about 30ms. The CLR has 30ms more startup overhead basically.

    Check out the space analysis for the reasons why.

  10. Ross says:

    That is kind of interesting.

    Do you have figures on the difference in memory usage?

  11. Ross says:

    Oops, just discovered the next blog entry :)

  12. I’m just not an expert.