Performance Quiz #6 -- Looking at the sixth cut

Article
05/19/2005

Well, it's time for me to surrender. Sort of :)

Raymond pulls out all the stops in his sixth version by painting a big bullseye on his biggest remaining source of slowness which is operator new. He turns in an excellent result here. On my benchmark machine I see the number drop from 124ms to 62ms -- a full 2x faster from start to finish. And observing the footnote on my previous message, the runtime for his application is now comparable to the CLR's startup overhead... I can't beat this time.

Let's look at the results table now to see how we ended up:

Version	Execution Time(seconds)
Unmanaged v1	1.328
Unmanaged v2	0.828
Unmanaged v3	0.343
Unmanaged v4	0.187
Unmanaged v5 With Bug	0.296
Unmanaged v5 Corrected	0.124
Unoptimized Managed port of v1	0.124
Optimized Managed port of v1	0.093
Unmanaged v6	0.062

Six versions and quite a bit of work later, we've been soundly trumped. But before I discuss that, let me put up the internal profile of Raymond's version 6

I've applied my usual filters to the call tree (nothing lower than 5% inclusive) and I also pruned out a couple of functions below HeapAlloc because they have long names and are boring :)

Function Name (Sanitized)	ExclusivePercent	InclusivePercent
_mainCRTStartup	0	97.826
_main	0	97.826
Dictionary::Dictionary(void)	5.435	96.739
MultiByteToWideChar	19.565	25
GetCPHashNode	5.435	5.435
operator new(unsigned int)	1.087	16.304
..	0	14.13
..	0	14.13
AllocateHeap	4.348	13.043
_free	0	8.696
FreeHeap	2.174	8.696
DictionaryEntry::Parse(...)	1.087	33.696
StringPool::AllocString(...)	2.174	27.174
_lstrcpynW	19.565	25
__SEH_prolog	5.435	5.435

You can see that the memory allocation time is way down as a percentage, and of course that's a smaller percentage of a smaller total time. I think he gets a lot of raw speed from his improved locality thanks to that new allocator as well. Interestingly SEH overhead is up to a signifcant level in this run (now over 5% for the first time). Still nothing to be worried about.

So am I ashamed by my crushing defeat? Hardly. The managed code got a very good result for hardly any effort. To defeat the managed Raymond had to:

Write his own file/io stuff
Write his own string class
Write his own allocator
Write his own international mapping

Of course he used available lower level libraries to do this, but that's still a lot of work. Can you call what's left an STL program? I don't think so, I think he kept the std::vector class which ultimately was never a problem and he kept the find function. Pretty much everything else is gone.

So, yup, you can definately beat the CLR. Raymond can make his program go even faster I think.

Interestingly, the time to parse the file as reported by both programs internal timers is about the same -- 30ms for each. The difference is in the overhead.

Tomorrow I'm going to talk about the space used by these programs and that will wrap it up. Though I think Raymond is going to go on and do some actual UI and so forth with this series. That should be fun to watch.

Performance Quiz #6 -- Looking at the sixth cut

Additional resources