Performance Quiz #6 — Looking at the first cut


Yesterday Raymond posted his initial version and I posted Performance Quiz #6. Today he posts the first optimization.  Quoting him:



Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in getline. Clearly this is where the focus needs to be. Everything else is just noise.


Digging a little deeper, it turns out that 29% of the CPU time was spent by getline doing character set conversion in codecvt::do_in. Some debugging revealed that codecvt::do_in was being called millions of times, each time converting just one or two characters. In fact, for each character in the file, codecvt::do_in was called once and sometimes twice!


And supporting this, my own measurements indicate this:














































Function Name   Exclusive
Percent
  Inclusive
Percent
_mainCRTStartup 0 98.569
_main 0 98.569
Dictionary::Dictionary(void) 0.11 96.918
std::getline<unsigned short,struct std::char_traits… etc… 0.055 91.635
std::getline<unsigned short,struct std::char_traits… etc… but different etc. 🙂 0.66 89.818
std::basic_filebuf<unsigned short,struct std::char_traits<unsigned short> >::uflow(void) 3.412 65.327
std::basic_filebuf<unsigned short,struct std::char_traits<unsigned short> >::underflow(void) 0.33 50.138
std::basic_streambuf<unsigned short,struct std::char_traits<unsigned short> >::snextc(void) 0.33 32.801
std::codecvt<unsigned short,char,int>::do_in(….) 2.862 28.674
__Mbrtowc 1.816 26.417


Exclusive time indicates the amount of time in the function itself.  Inclusive indicates the amount in the function plus the children of that function.


These measurements were taken with the Visual Studio Enterprise Performance Tool  in sampling mode (one sample every 1,000,000 cycles).


Interestingly if you look at where the Exclusive time is (i.e the specific functions taking up the time in the tree) you see something surprising.






























Function Name (sanitized)  Exclusive
Percent
 Inclusive
Percent
EnterCriticalSection 14.474 14.474
LeaveCriticalSection 12.053 12.053
InterlockedIncrement 6.879 6.879
InterlockedDecrement 5.834 5.834
GetCPHashNode 5.394 5.394
MultiByteToWideChar 4.293 9.961


Look at that!  The top 4 functions, 39.24% of the time, and I can add those because they are all leaf functions (Exclusive = Inclusive), have to do with synchronization operations of one kind or another.  Even the interlocked operations have some sort of synchronization intent because of course if they weren’t trying to synchronize across threads they wouldn’t need to be interlocked.  And the best part is this app is single threaded!  Yumm 🙂


MultiByteToWideChar which is arguably part of What We Actually Have To Do (TM) is a mere 9.961% inclusive.  Not good.


Naturally the bulk of this is in the line reading as shown by the inclusive time.


But my question remains unanswered: How do you think the managed system will fare?

Comments (12)

  1. G. Homedog says:

    I dont know, how did it fare?

  2. AdiOltean says:

    EnterCriticalSection 14.474 14.474

    LeaveCriticalSection 12.053 12.053

    Hmmm… don’t we have an optimization in EnterCriticalSection such that if there is no lock contention, then we won’t perform a kernel mode transition? I wonder what makes these routines so slow in that case.

  3. Gia says:

    Come on man, stop teasing people. Without actually running it of course nobody can "think" how it would fare, and what unexpected bottleneck will be hit. If you want to hear what "inner voice" tells me, then it says that if there are no bottlenecks in both managed and unmanaged versions, then unmanaged version will be faster because of quick object allocation.

  4. Gia says:

    Oops, I meant to say MANAGED, not unmanaged 🙂

  5. I’m wondering what piece of code thinks it needs synchronization… Unless that’s the heap critical section that we’re seeing here…

    I wonder what would happen if he switched the process heap to be nosynchronize…

  6. The more that I think about it, the more that I think this is the heap critical section.

    I find it astonishing that the heap doesn’t bubble up as a huge time sink given the amount of reallocation that’s going on here, I’m wondering if the heap calls are showing as ECS/LCS hits (since this is a sampling profiler it’s possible that the ECS/LCS/II/ID calls are all from a single function)…

  7. stefang says:

    Several interesting things here.

    1) Comparing the C# and C++ implementations

    When running raymonds original code on my machine, the total time is 840 ms

    When I run your C# version, the total time is 106 ms

    So, the managed version is much faster.

    2) Analyzing the reason for the slowness of the C++ version

    Profiling with Rational Quantify gives the following inclusive percentages for the lines in the inner loop:

    Incl %

    72,65 while (getline(src, s)) {

    0,02 if (s.length() > 0 && s[0] != L’#’) {

    0,02 DictionaryEntry de;

    5,41 if (de.Parse(s)) {

    21,68 v.push_back(de);

    }

    So, getline() uses 73% of the time, and push_back() uses 22%

    Further analysis of the reason for the slowness of getline() shows that the main problem is that it is processing the characters one by one.

    It is even calling MultiByteToWideChar once for each character.

    push_back spends 79% of its time in HeapAlloc and HeapFree.

    3) Why is the managed version so much faster ?

    Three big reasons:

    A) The StreamReader is very well written. It performs all its MultiByteToWideChar processing in big blocks.

    B) The managed memory allocator is very very fast.

    C) The C++ code causes a lot of reallocation when appending elements to the vector without reserving space

    4) Why do you see a lot of calls to thread synchronization functions in your measurements ?

    I see no such calls at all.

    I have compiled my code with Visual Studio 2003, using the single threaded runtime library and full optimization

    (/O2 /ML)

    I suspect that you are using a multi-threaded library, and perhaps you are using whidbey ?

    5) What can be done to the C++-version to make it faster ?

    * Replace the calls to getline(wifstream) with manual use uf fgets and MultiByteToWideChar

    * Use vector::reserve to avoid reallocations when using push_back

    Like this:

    Dictionary::Dictionary()

    {

    FILE *src = fopen("C:\dload\cedict\cedict.b5", "rt");

    char buf[1000];

    wchar_t wbuf[1000];

    v.reserve(30000);

    while (fgets(buf, 1000, src) != NULL) {

    int len = (int)strlen(buf);

    // Remove any terminating LF

    if (len > 0 && buf[len-1] == ‘n’)

    len–;

    int wlen = MultiByteToWideChar(950, MB_PRECOMPOSED, buf, len, wbuf, 1000);

    wstring s(wbuf, wlen);

    if (s.length() > 0 && s[0] != L’#’) {

    DictionaryEntry de;

    if (de.Parse(s)) {

    v.push_back(de);

    }

    }

    }

    fclose(src);

    }

    The code above runs in 170 ms. The biggest problem now is all the allocations that are being made in Parse()

    It is interesting to note that even with a fairly complicated and optimized C++ implementation, it is still almost 60% slower than

    the unoptimized C#-version.

    Having a really fast allocator really makes wonders for performance.

    Anyway a startuptime of 170 ms seems good enough.

  8. Robert Kozak says:

    "But my question remains unanswered: How do you think the managed system will fare?"

    Rico, If there is one thing I learned from reading your blog is that it is a waste of time to guess. You have to test, test, test! <g>

    — Robert

  9. Stefang jumped into the fray with his analysis in the comments from my last posting.&amp;nbsp; Thank you…

  10. So I was reading through one of my favorite MSDN blogs (http://blogs.msdn.com/oldnewthing/)

    And he…

  11. Stefang jumped into the fray with his analysis in the comments from my last posting . Thank you Stefang.

  12. Raymond Chen is running a series of articles about how to build and optimize the startup time of a Chinese/English