Performance Quiz #6 -- Looking at the first cut
Yesterday Raymond posted his initial version and I posted Performance Quiz #6. Today he posts the first optimization. Quoting him:
Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in
getline
. Clearly this is where the focus needs to be. Everything else is just noise.Digging a little deeper, it turns out that 29% of the CPU time was spent by
getline
doing character set conversion incodecvt::do_in
. Some debugging revealed thatcodecvt::do_in
was being called millions of times, each time converting just one or two characters. In fact, for each character in the file,codecvt::do_in
was called once and sometimes twice!
And supporting this, my own measurements indicate this:
Function Name | ExclusivePercent | InclusivePercent |
_mainCRTStartup | 0 | 98.569 |
_main | 0 | 98.569 |
Dictionary::Dictionary(void) | 0.11 | 96.918 |
std::getline<unsigned short,struct std::char_traits... etc... | 0.055 | 91.635 |
std::getline<unsigned short,struct std::char_traits... etc... but different etc. :) | 0.66 | 89.818 |
std::basic_filebuf<unsigned short,struct std::char_traits<unsigned short> >::uflow(void) | 3.412 | 65.327 |
std::basic_filebuf<unsigned short,struct std::char_traits<unsigned short> >::underflow(void) | 0.33 | 50.138 |
std::basic_streambuf<unsigned short,struct std::char_traits<unsigned short> >::snextc(void) | 0.33 | 32.801 |
std::codecvt<unsigned short,char,int>::do_in(....) | 2.862 | 28.674 |
__Mbrtowc | 1.816 | 26.417 |
Exclusive time indicates the amount of time in the function itself. Inclusive indicates the amount in the function plus the children of that function.
These measurements were taken with the Visual Studio Enterprise Performance Tool in sampling mode (one sample every 1,000,000 cycles).
Interestingly if you look at where the Exclusive time is (i.e the specific functions taking up the time in the tree) you see something surprising.
Function Name (sanitized) | ExclusivePercent | InclusivePercent |
EnterCriticalSection | 14.474 | 14.474 |
LeaveCriticalSection | 12.053 | 12.053 |
InterlockedIncrement | 6.879 | 6.879 |
InterlockedDecrement | 5.834 | 5.834 |
GetCPHashNode | 5.394 | 5.394 |
MultiByteToWideChar | 4.293 | 9.961 |
Look at that! The top 4 functions, 39.24% of the time, and I can add those because they are all leaf functions (Exclusive = Inclusive), have to do with synchronization operations of one kind or another. Even the interlocked operations have some sort of synchronization intent because of course if they weren't trying to synchronize across threads they wouldn't need to be interlocked. And the best part is this app is single threaded! Yumm :)
MultiByteToWideChar which is arguably part of What We Actually Have To Do (TM) is a mere 9.961% inclusive. Not good.
Naturally the bulk of this is in the line reading as shown by the inclusive time.
But my question remains unanswered: How do you think the managed system will fare?