Performance Quiz #6 -- The fifth cut goes foul!

Well today Raymond hits a snag in version 5 of the program.  I started by profiling the code as provided (in fact I had the .exe from Raymond himself so I didn't even have to build it).  The new version is actually slower (start to finish) than the previous version.  Why?  Well there's a little bug in there.

 struct DictionaryEntry
{
 bool Parse(const WCHAR* begin, const WCHAR* end);
 void Destruct() {
  delete[] m_pszTrad;
  delete[] m_pszSimp;  // this throws an exception
  delete[] m_pszPinyin;
  delete[] m_pszEnglish;
 }
 LPWSTR m_pszTrad;
 LPWSTR m_pszSimp; // this field is never initialized
 LPWSTR m_pszPinyin;
 LPWSTR m_pszEnglish;
};

Remember the timings I'm doing include the entire run of the program -- start to finish -- and not just the dictionary creation time.  That's a little bit different than what Raymond is measuring.  That choice is actually giving me quite a handicap because my startup time includes all of the code necessary to get the CLR going which is not insignficant (see my previous posting).  This particular bug causes an exception to be raised during the teardown of the dictionary which in turn slows things down by far more than the savings.

Why bring this up at all?  I mean it's just a minor mistake that's easily corrected right?  Well it's true but the point is that the code that was necessary to reach this performance level is a good bit trickier. When considering development costs you need to think very carefully about what sort of code you will have to write to get the performance that you need.  Even great developers make mistakes so if you imagine this code was part of some bigger project all similarly written you could easily see developers making unfortunate mistakes when working in that code base.  It makes people less likely to want to do performance work (or any work) and so the product suffers.  Many people have already commented that this code has lost most of its object oriented nature.  I think this little bug is a symptom of that fact.  So I don't want to conclude too much but I think it's fair comment that the code is getting to be harder to maintain.

The upshot of this all is that I can't give Raymond full credit for an improvement here, it's sort of a foul tip.  But luckily the problem is easily corrected so I fixed it in my copy of the source and rebuilt.  Now unfortunately that means I can't be sure that I build it in exactly the same way as Raymond did so it's possible that there are other differences in my version 5 corrected number.  But, including the correction, version 5 is now 1.5 times faster than verison 4.  It's neck-and-neck with the managed code, at 124ms overall time.

Version Execution Time(seconds)
Unmanaged v1 1.328
Unmanaged v2 0.828
Unmanaged v3 0.343
Unmanaged v4 0.187
Unmanaged v5 With Bug 0.296
Unmanaged v5 Corrected 0.124
Unoptimized Managed port of v1       0.124

And here again are the function costs broken down in tree view, only showing functions with an inclusive cost of at least 5% to keep the size of the report manageable.

Name (Sanitized) ExclusivePercent InclusivePercent
  _mainCRTStartup 0 94.606
    _main 0 94.191
      Dictionary::Dictionary(void) 2.905 67.635
MultiByteToWideChar 7.884 9.959
        DictionaryEntry::Parse(...) 2.075 42.324
          AllocString(wchar_t const *,wchar_t const *) 0.415 27.386
            _lstrcpynW 7.054 7.469
            operator new(unsigned int) 0 19.502
              _malloc 0.415 19.502
                AllocateHeap 6.639 19.087
          operator new(unsigned int) 0 9.544
            _malloc 0 9.544
              AllocateHeap 2.905 9.544
        operator new(unsigned int) 0.415 6.224
          _malloc 0 5.809
            RtlAllocateHeap 1.66 5.809
      Dictionary::~Dictionary(void) 0.83 26.556
        _free 1.245 25.726
          FreeHeap 6.224 22.822
            EnterCriticalSection 6.224 6.224

So we can see some good things have happened.  The string copying cost is way down at only about 7.5%.  The vector handling is totally gone from this run (actually it fell below my 5% -- it's down at 2.9% at this point).  The new winner is operator new which is coming in at a total of about 35% of the total execution time spread over 3 different places.  Also becoming increasingly important is the Dictionary tear down time.  It's at 26.5% now (up from 18.84%).

I've often said that it's neary impossible to use "malloc" just like it comes.  Applications that need high performance invariably end up wrapping it, creating some kind of custom allocator that is more suitable for their applications' needs.  Looks like that's what we'll have to do next.

Hmmm... shall I post the optimized version of the managed code now or wait until the unmanaged code is soundly beating me?  Decisions decisions...