Performance Quiz #6 -- Looking at the third cut

The fun continues as today we look at Raymond's third improvement.

Raymond starts using some pretty hefty Win32 magic with a mapped file view to eliminate once-and-for-all any costs associated with the getline function.  And good news, his new version is 2.41 times faster than the previous version.  Myself I don't think I would have gone for the mapped file.  I think I would have just used regular I/O in bigger chunks and done my own line splitting at this point.  I didn't ask Raymond but I bet he did it this way at least partly because he's comfortable with MapViewOfFile.

Version Execution Time(seconds)
Unmanaged v1 1.328
Unmanaged v2 0.828
Unmanaged v3 0.343

Unoptimized Managed Port      

0.124

Let's drill into the costs a little bit more, this time I'll show them in tree view.  This is all the function with an inclusive cost of at least 5%.  That trims the tree nicely and yet still shows the meat.

Sanitized Function Name  ExclusivePercent  InclusivePercent
  _mainCRTStartup 0.00 96.10
    _main 0.00 96.10
      Dictionary::Dictionary(void) 0.12 87.94
        DictionaryEntry::Parse(...) 0.24 10.96
          std::basic_string< ... >::assign(...) 0.12 7.07
            std::basic_string<... >::_Grow(unsigned int,bool) 0.12 6.70
              std::basic_string<... >::_Copy(unsigned int) 0.24 6.46
                operator new(unsigned int) 0.00 6.09
                  .. 0.00 6.09
                    .. 0.00 6.09
                      AllocateHeap 1.83 5.97
        std::codecvt<...>::do_in(...) 3.41 59.44
          __Mbrtowc 3.41 56.03
            InterlockedDecrement 13.03 13.03
            InterlockedIncrement 9.74 9.74
            Mbrtowc_lk 3.41 29.84
              MultiByteToWideChar 11.69 26.43
                GetCPHashNode 14.37 14.37
        std::vector<DictionaryEntry,...>::insert(..etc1..) 0.12 7.43
          std::vector<DictionaryEntry,...>::insert(..etc2..) 0.12 7.19
      std::vector<DictionaryEntry,...>::~vector<…. >(void) 0.00 7.92
        std::vector<DictionaryEntry,... >::_Destroy(...) 0.00 7.92
          DictionaryEntry::~DictionaryEntry(void) 0.24 7.92
            std::basic_string<...>(void) 0.24 7.19
              std::basic_string< … >::_Tidy(bool) 0.37 6.94
                _free 0.00 6.58
                  FreeHeap 1.34 6.58

Looking at that tree we can make some interesting observations.  First it's pretty clear that do_in is what we should target. There's a lot of work going on there, probably more than is necessary.  Goodness knows why InterlockedIncrement and InterlockedDecrement are being called by _Mbrtowc -- and that's adding up to something.  Anyway you slice it going directly to MultiByteToWideChar which is what's doing the actual work is seeming like a really good idea.   I bet that's where Raymond goes next.

Also worth noting is that we're measuing the teardown cost of the dictionary and that is adding 7.92% to the overall compute time almost all of which is freeing the memory.  There's a good lesson there -- many application try very hard to free their memory before they exit but I find this to generally be a total waste of time.  The OS will give that memory back just fine -- the last thing you should do is force it all to be loaded back into the processe's working set and then the processors cache.  It might have been swapped out!   In this case it's not much anyway though.

We're working pretty hard and still almost 3x slower than the managed code.  Stay tuned as  Quiz #6 continues to develop :)