Performance Quiz #6 -- Looking at the sixth cut
Well, it's time for me to surrender. Sort of :)
Raymond pulls out all the stops in his sixth version by painting a big bullseye on his biggest remaining source of slowness which is operator new. He turns in an excellent result here. On my benchmark machine I see the number drop from 124ms to 62ms -- a full 2x faster from start to finish. And observing the footnote on my previous message, the runtime for his application is now comparable to the CLR's startup overhead... I can't beat this time.
Let's look at the results table now to see how we ended up:
Version | Execution Time(seconds) |
Unmanaged v1 | 1.328 |
Unmanaged v2 | 0.828 |
Unmanaged v3 | 0.343 |
Unmanaged v4 | 0.187 |
Unmanaged v5 With Bug | 0.296 |
Unmanaged v5 Corrected | 0.124 |
Unoptimized Managed port of v1 | 0.124 |
Optimized Managed port of v1 | 0.093 |
Unmanaged v6 | 0.062 |
Six versions and quite a bit of work later, we've been soundly trumped. But before I discuss that, let me put up the internal profile of Raymond's version 6
I've applied my usual filters to the call tree (nothing lower than 5% inclusive) and I also pruned out a couple of functions below HeapAlloc because they have long names and are boring :)
Function Name (Sanitized) | ExclusivePercent | InclusivePercent |
_mainCRTStartup | 0 | 97.826 |
_main | 0 | 97.826 |
Dictionary::Dictionary(void) | 5.435 | 96.739 |
MultiByteToWideChar | 19.565 | 25 |
GetCPHashNode | 5.435 | 5.435 |
operator new(unsigned int) | 1.087 | 16.304 |
.. | 0 | 14.13 |
.. | 0 | 14.13 |
AllocateHeap | 4.348 | 13.043 |
_free | 0 | 8.696 |
FreeHeap | 2.174 | 8.696 |
DictionaryEntry::Parse(...) | 1.087 | 33.696 |
StringPool::AllocString(...) | 2.174 | 27.174 |
_lstrcpynW | 19.565 | 25 |
__SEH_prolog | 5.435 | 5.435 |
You can see that the memory allocation time is way down as a percentage, and of course that's a smaller percentage of a smaller total time. I think he gets a lot of raw speed from his improved locality thanks to that new allocator as well. Interestingly SEH overhead is up to a signifcant level in this run (now over 5% for the first time). Still nothing to be worried about.
So am I ashamed by my crushing defeat? Hardly. The managed code got a very good result for hardly any effort. To defeat the managed Raymond had to:
- Write his own file/io stuff
- Write his own string class
- Write his own allocator
- Write his own international mapping
Of course he used available lower level libraries to do this, but that's still a lot of work. Can you call what's left an STL program? I don't think so, I think he kept the std::vector class which ultimately was never a problem and he kept the find function. Pretty much everything else is gone.
So, yup, you can definately beat the CLR. Raymond can make his program go even faster I think.
Interestingly, the time to parse the file as reported by both programs internal timers is about the same -- 30ms for each. The difference is in the overhead.
Tomorrow I'm going to talk about the space used by these programs and that will wrap it up. Though I think Raymond is going to go on and do some actual UI and so forth with this series. That should be fun to watch.