Performance Quiz #6 — Looking at the second cut


Stefang jumped into the fray with his analysis in the comments from my last posting.  Thank you Stefang.  And it seems, as usual, things aren’t always what they appear 🙂


Let’s continue the story without me giving away the ending just yet though.  How is Raymond is doing? My last posting studied his first effort, that’s the one that I ported for Quiz #6, and today we’ll look at his second effort and see how he did.  Remember we were targetting the time in the getline function.


To get a rough idea if he’s doing well I measured the raw execution time on my own machine, from process start to process exit.


Unmanaged Version 1:    Time of execution 1.328
Unmanaged Version 2:    Time of execution 0.828

So the new version is 1.6 times faster.  Nice!


And, in some detail, this what I see for his version #2.














































Function Name   Exclusive Percent   Inclusive Percent
_main 0 97.695
_mainCRTStartup 0 97.695
Dictionary::Dictionary(void) 0.231 94.744
std::getline<char,struct std::char_traits… etc. 0 61.918
std::getline<char,struct std::char_traits… another etc 0.922 59.382
std::basic_filebuf<char,…>::underflow(void) 0.692 37.575
std::basic_filebuf<char,…> >::uflow(void) 2.259 34.947
_fgetc 3.043 32.826
std::_Fgetc(char &,struct _iobuf *) 0.461 32.688
__lock_file 1.614 21.254


And sure enough the time in getline has dropped to about 62%.   Of course that’s 62% of a smaller number too so it’s even better than it looks.


Interestingly, we’re still paying what looks a hefty toll for using a multi-threaded version of the runtime… that’s what we get by default.  But it would be interesting to see what you guys get if you statically link with a different runtime.














































Function Name
(sanitized)
 Exclusive
Percent
 Inclusive
Percent
RtlLeaveCriticalSection 18.811 18.811
RtlEnterCriticalSection 18.119 18.119
GetCPHashNode 4.61 4.61
InterlockedIncrement 4.38 4.38
InterlockedDecrement 4.149 4.149
MultiByteToWideChar 3.965 8.852
__lock 3.827 19.64
_fgetc 3.043 32.826
__SEH_prolog 2.582 2.582
std::basic_filebuf<char,struct std::char_traits<char> >::uflow(void) 2.259 34.947


And I’m going to tip my hand now without doing the analysis, here’s how the numbers stack up so far:














Version Execution Time
(seconds)
Unmanaged v1 1.328
Unmanaged v2

0.828

Unoptimized Managed Port       0.124


Color me happy 🙂  I get numbers comparable to what Stefang reported.  But the story isn’t over yet!  Raymond’s still got plenty of ammunition.  Let’s see what happens when he brings out the heavy artillery.  And we’ve yet to see what the profile looks like for the managed code.

Comments (5)

  1. Erik Madsen says:

    This is a really interesting demonstration of how to engineer for performance. And to serialize the tale like a detective novel… Brilliant! Keep up the good work.

  2. stefang says:

    The timing for this program depends very much on compiler settings.

    I have tried two variants:

    1) cl /EHsc /MT cpp.cpp (Multi-threaded library, no optimization)

    2) cl /EHsc /MT /O2 cpp.cpp (Multi-threaded library, optimize for speed)

    I get the following times:

    1) 1400 ms

    2) 940 ms

    The main reason for this is that much of the STL code gets inlined and optimized away when using /O2

    I am really curious about your timings. You say that your program spends 62% in getline().

    When I run the two program compiled with the different settings above I get the following getline() percentage on my machine (complied with VS2003)

    1) 30%

    2) 13%

    What could be the possible reason why you get 62% and I never get over 30% ?

    I suspect that you are using VS2005 and that the reason for the performance difference is in differences in the STL library for VS2003 and VS2005.

    Perhaps there is a performance bug in STL for VS2005 ?

    In that case, maybe you could do something about it before VS2005 is released ?

  3. ricom says:

    It’s really hard to say what the root causes might be. I have an advantage you didn’t have — I’m measuring the binaries Raymond built because I asked him to send them to me so measurements would match his results as closely as possible.

    Of course adding /O2 can really change things — perhaps Raymond build with /O1 ? We’ll have to ask him when he gets back. I could try building everything again to see what difference it makes. Building without /MT might be interesting. But it’s a whole different exercise.

    There may also be machine architecture issues. I’m running on a P4 at about 3Ghz with HT enabled for what it’s worth.

  4. stefang says:

    The reason I suspect that there is a problem with the STL implementation is that you show a function called

    std::_Fgetc(char &,struct _iobuf *) with 32% incl.

    No such function is used when I run the program.

    in my runs _fgetc calls filbuf directly, and _fgetc only uses 4% of the runtime.

    Something is really weird here.

  5. So I was reading through one of my favorite MSDN blogs (http://blogs.msdn.com/oldnewthing/)

    And he…