Performance Quiz #6 -- Looking at the sixth cut

Well, it's time for me to surrender.  Sort of :)

Raymond pulls out all the stops in his sixth version by painting a big bullseye on his biggest remaining source of slowness which is operator new.   He turns in an excellent result here.  On my benchmark machine I see the number drop from 124ms to 62ms -- a full 2x faster from start to finish.  And observing the footnote on my previous message, the runtime for his application is now comparable to the CLR's startup overhead...  I can't beat this time.

Let's look at the results table now to see how we ended up:

Version Execution Time(seconds)
Unmanaged v1 1.328
Unmanaged v2 0.828
Unmanaged v3 0.343
Unmanaged v4 0.187
Unmanaged v5 With Bug 0.296
Unmanaged v5 Corrected 0.124
Unoptimized Managed port of v1    0.124
Optimized Managed port of v1 0.093
Unmanaged v6 0.062

Six versions and quite a bit of work later, we've been soundly trumped.  But before I discuss that, let me put up the internal profile of Raymond's version 6

I've applied my usual filters to the call tree (nothing lower than 5% inclusive) and I also pruned out a couple of functions below HeapAlloc because they have long names and are boring :)

Function Name (Sanitized)     ExclusivePercent      InclusivePercent
  _mainCRTStartup 0 97.826
    _main 0 97.826
      Dictionary::Dictionary(void) 5.435 96.739
        MultiByteToWideChar 19.565 25
          GetCPHashNode 5.435 5.435
     operator new(unsigned int) 1.087 16.304
          .. 0 14.13
            .. 0 14.13
              AllocateHeap 4.348 13.043
        _free 0 8.696
          FreeHeap 2.174 8.696
        DictionaryEntry::Parse(...) 1.087 33.696
          StringPool::AllocString(...) 2.174 27.174
            _lstrcpynW 19.565 25
              __SEH_prolog 5.435 5.435

You can see that the memory allocation time is way down as a percentage, and of course that's a smaller percentage of a smaller total time.  I think he gets a lot of raw speed from his improved locality thanks to that new allocator as well.  Interestingly SEH overhead is up to a signifcant level in this run (now over 5% for the first time).  Still nothing to be worried about.

So am I ashamed by my crushing defeat?  Hardly.  The managed code got a very good result for hardly any effort. To defeat the managed Raymond had to:

  • Write his own file/io stuff
  • Write his own string class
  • Write his own allocator
  • Write his own international mapping

Of course he used available lower level libraries to do this, but that's still a lot of work.  Can you call what's left an STL program?  I don't think so, I think he kept the std::vector class which ultimately was never a problem and he kept the find function.  Pretty much everything else is gone.

So, yup, you can definately beat the CLR.  Raymond can make his program go even faster I think.

Interestingly, the time to parse the file as reported by both programs internal timers is about the same -- 30ms for each.  The difference is in the overhead.

Tomorrow I'm going to talk about the space used by these programs and that will wrap it up.  Though I think Raymond is going to go on and do some actual UI and so forth with this series.  That should be fun to watch.