Performance Quiz #6 -- Looking at the first cut

Yesterday Raymond posted his initial version and I posted Performance Quiz #6. Today he posts the first optimization.  Quoting him:

Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in getline. Clearly this is where the focus needs to be. Everything else is just noise.

Digging a little deeper, it turns out that 29% of the CPU time was spent by getline doing character set conversion in codecvt::do_in. Some debugging revealed that codecvt::do_in was being called millions of times, each time converting just one or two characters. In fact, for each character in the file, codecvt::do_in was called once and sometimes twice!

And supporting this, my own measurements indicate this:

Function Name   ExclusivePercent   InclusivePercent
_mainCRTStartup 0 98.569
_main 0 98.569
Dictionary::Dictionary(void) 0.11 96.918
std::getline<unsigned short,struct std::char_traits... etc... 0.055 91.635
std::getline<unsigned short,struct std::char_traits... etc... but different etc. :) 0.66 89.818
std::basic_filebuf<unsigned short,struct std::char_traits<unsigned short> >::uflow(void) 3.412 65.327
std::basic_filebuf<unsigned short,struct std::char_traits<unsigned short> >::underflow(void) 0.33 50.138
std::basic_streambuf<unsigned short,struct std::char_traits<unsigned short> >::snextc(void) 0.33 32.801
std::codecvt<unsigned short,char,int>::do_in(....) 2.862 28.674
__Mbrtowc 1.816 26.417

Exclusive time indicates the amount of time in the function itself.  Inclusive indicates the amount in the function plus the children of that function.

These measurements were taken with the Visual Studio Enterprise Performance Tool  in sampling mode (one sample every 1,000,000 cycles).

Interestingly if you look at where the Exclusive time is (i.e the specific functions taking up the time in the tree) you see something surprising.

Function Name (sanitized)  ExclusivePercent  InclusivePercent
EnterCriticalSection 14.474 14.474
LeaveCriticalSection 12.053 12.053
InterlockedIncrement 6.879 6.879
InterlockedDecrement 5.834 5.834
GetCPHashNode 5.394 5.394
MultiByteToWideChar 4.293 9.961

Look at that!  The top 4 functions, 39.24% of the time, and I can add those because they are all leaf functions (Exclusive = Inclusive), have to do with synchronization operations of one kind or another.  Even the interlocked operations have some sort of synchronization intent because of course if they weren't trying to synchronize across threads they wouldn't need to be interlocked.  And the best part is this app is single threaded!  Yumm :)

MultiByteToWideChar which is arguably part of What We Actually Have To Do (TM) is a mere 9.961% inclusive.  Not good.

Naturally the bulk of this is in the line reading as shown by the inclusive time.

But my question remains unanswered: How do you think the managed system will fare?