Performance Quiz #6 -- The fifth cut goes foul!

Article
05/18/2005

Well today Raymond hits a snag in version 5 of the program. I started by profiling the code as provided (in fact I had the .exe from Raymond himself so I didn't even have to build it). The new version is actually slower (start to finish) than the previous version. Why? Well there's a little bug in there.

 struct DictionaryEntry
{
 bool Parse(const WCHAR* begin, const WCHAR* end);
 void Destruct() {
  delete[] m_pszTrad;
  delete[] m_pszSimp;  // this throws an exception
  delete[] m_pszPinyin;
  delete[] m_pszEnglish;
 }
 LPWSTR m_pszTrad;
 LPWSTR m_pszSimp; // this field is never initialized
 LPWSTR m_pszPinyin;
 LPWSTR m_pszEnglish;
};

Remember the timings I'm doing include the entire run of the program -- start to finish -- and not just the dictionary creation time. That's a little bit different than what Raymond is measuring. That choice is actually giving me quite a handicap because my startup time includes all of the code necessary to get the CLR going which is not insignficant (see my previous posting). This particular bug causes an exception to be raised during the teardown of the dictionary which in turn slows things down by far more than the savings.

Why bring this up at all? I mean it's just a minor mistake that's easily corrected right? Well it's true but the point is that the code that was necessary to reach this performance level is a good bit trickier. When considering development costs you need to think very carefully about what sort of code you will have to write to get the performance that you need. Even great developers make mistakes so if you imagine this code was part of some bigger project all similarly written you could easily see developers making unfortunate mistakes when working in that code base. It makes people less likely to want to do performance work (or any work) and so the product suffers. Many people have already commented that this code has lost most of its object oriented nature. I think this little bug is a symptom of that fact. So I don't want to conclude too much but I think it's fair comment that the code is getting to be harder to maintain.

The upshot of this all is that I can't give Raymond full credit for an improvement here, it's sort of a foul tip. But luckily the problem is easily corrected so I fixed it in my copy of the source and rebuilt. Now unfortunately that means I can't be sure that I build it in exactly the same way as Raymond did so it's possible that there are other differences in my version 5 corrected number. But, including the correction, version 5 is now 1.5 times faster than verison 4. It's neck-and-neck with the managed code, at 124ms overall time.

Version	Execution Time(seconds)
Unmanaged v1	1.328
Unmanaged v2	0.828
Unmanaged v3	0.343
Unmanaged v4	0.187
Unmanaged v5 With Bug	0.296
Unmanaged v5 Corrected	0.124
Unoptimized Managed port of v1	0.124

And here again are the function costs broken down in tree view, only showing functions with an inclusive cost of at least 5% to keep the size of the report manageable.

Name (Sanitized)	ExclusivePercent	InclusivePercent
_mainCRTStartup	0	94.606
_main	0	94.191
Dictionary::Dictionary(void)	2.905	67.635
MultiByteToWideChar	7.884	9.959
DictionaryEntry::Parse(...)	2.075	42.324
AllocString(wchar_t const ,wchar_t const )	0.415	27.386
_lstrcpynW	7.054	7.469
operator new(unsigned int)	0	19.502
_malloc	0.415	19.502
AllocateHeap	6.639	19.087
operator new(unsigned int)	0	9.544
_malloc	0	9.544
AllocateHeap	2.905	9.544
operator new(unsigned int)	0.415	6.224
_malloc	0	5.809
RtlAllocateHeap	1.66	5.809
Dictionary::~Dictionary(void)	0.83	26.556
_free	1.245	25.726
FreeHeap	6.224	22.822
EnterCriticalSection	6.224	6.224

So we can see some good things have happened. The string copying cost is way down at only about 7.5%. The vector handling is totally gone from this run (actually it fell below my 5% -- it's down at 2.9% at this point). The new winner is operator new which is coming in at a total of about 35% of the total execution time spread over 3 different places. Also becoming increasingly important is the Dictionary tear down time. It's at 26.5% now (up from 18.84%).

I've often said that it's neary impossible to use "malloc" just like it comes. Applications that need high performance invariably end up wrapping it, creating some kind of custom allocator that is more suitable for their applications' needs. Looks like that's what we'll have to do next.

Hmmm... shall I post the optimized version of the managed code now or wait until the unmanaged code is soundly beating me? Decisions decisions...

Performance Quiz #6 -- The fifth cut goes foul!

Additional resources