Performance Quiz #6 -- Looking at the first cut

Article
05/11/2005

Yesterday Raymond posted his initial version and I posted Performance Quiz #6. Today he posts the first optimization. Quoting him:

Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in getline. Clearly this is where the focus needs to be. Everything else is just noise.

Digging a little deeper, it turns out that 29% of the CPU time was spent by getline doing character set conversion in codecvt::do_in. Some debugging revealed that codecvt::do_in was being called millions of times, each time converting just one or two characters. In fact, for each character in the file, codecvt::do_in was called once and sometimes twice!

And supporting this, my own measurements indicate this:

Function Name	ExclusivePercent	InclusivePercent
_mainCRTStartup	0	98.569
_main	0	98.569
Dictionary::Dictionary(void)	0.11	96.918
std::getline<unsigned short,struct std::char_traits... etc...	0.055	91.635
std::getline<unsigned short,struct std::char_traits... etc... but different etc. :)	0.66	89.818
std::basic_filebuf<unsigned short,struct std::char_traits<unsigned short> >::uflow(void)	3.412	65.327
std::basic_filebuf<unsigned short,struct std::char_traits<unsigned short> >::underflow(void)	0.33	50.138
std::basic_streambuf<unsigned short,struct std::char_traits<unsigned short> >::snextc(void)	0.33	32.801
std::codecvt<unsigned short,char,int>::do_in(....)	2.862	28.674
__Mbrtowc	1.816	26.417

Exclusive time indicates the amount of time in the function itself. Inclusive indicates the amount in the function plus the children of that function.

These measurements were taken with the Visual Studio Enterprise Performance Tool in sampling mode (one sample every 1,000,000 cycles).

Interestingly if you look at where the Exclusive time is (i.e the specific functions taking up the time in the tree) you see something surprising.

Function Name (sanitized)	ExclusivePercent	InclusivePercent
EnterCriticalSection	14.474	14.474
LeaveCriticalSection	12.053	12.053
InterlockedIncrement	6.879	6.879
InterlockedDecrement	5.834	5.834
GetCPHashNode	5.394	5.394
MultiByteToWideChar	4.293	9.961

Look at that! The top 4 functions, 39.24% of the time, and I can add those because they are all leaf functions (Exclusive = Inclusive), have to do with synchronization operations of one kind or another. Even the interlocked operations have some sort of synchronization intent because of course if they weren't trying to synchronize across threads they wouldn't need to be interlocked. And the best part is this app is single threaded! Yumm :)

MultiByteToWideChar which is arguably part of What We Actually Have To Do (TM) is a mere 9.961% inclusive. Not good.

Naturally the bulk of this is in the line reading as shown by the inclusive time.

But my question remains unanswered: How do you think the managed system will fare?

Performance Quiz #6 -- Looking at the first cut

Additional resources