Performance Quiz #6 — Looking at the third cut


The fun continues as today we look at Raymond’s third improvement.


Raymond starts using some pretty hefty Win32 magic with a mapped file view to eliminate once-and-for-all any costs associated with the getline function.  And good news, his new version is 2.41 times faster than the previous version.  Myself I don’t think I would have gone for the mapped file.  I think I would have just used regular I/O in bigger chunks and done my own line splitting at this point.  I didn’t ask Raymond but I bet he did it this way at least partly because he’s comfortable with MapViewOfFile.

















Version Execution Time
(seconds)
Unmanaged v1 1.328
Unmanaged v2 0.828
Unmanaged v3 0.343

Unoptimized Managed Port      

0.124


Let’s drill into the costs a little bit more, this time I’ll show them in tree view.  This is all the function with an inclusive cost of at least 5%.  That trims the tree nicely and yet still shows the meat.


















































































































Sanitized Function Name
 Exclusive
Percent
 Inclusive
Percent
  _mainCRTStartup 0.00 96.10
    _main 0.00 96.10
      Dictionary::Dictionary(void) 0.12 87.94
        DictionaryEntry::Parse(…) 0.24 10.96
          std::basic_string< … >::assign(…) 0.12 7.07
            std::basic_string<… >::_Grow(unsigned int,bool) 0.12 6.70
              std::basic_string<… >::_Copy(unsigned int) 0.24 6.46
                operator new(unsigned int) 0.00 6.09
                  .. 0.00 6.09
                    .. 0.00 6.09
                      AllocateHeap 1.83 5.97
        std::codecvt<…>::do_in(…) 3.41 59.44
          __Mbrtowc 3.41 56.03
            InterlockedDecrement 13.03 13.03
            InterlockedIncrement 9.74 9.74
            Mbrtowc_lk 3.41 29.84
              MultiByteToWideChar 11.69 26.43
                GetCPHashNode 14.37 14.37
        std::vector<DictionaryEntry,…>::insert(..etc1..) 0.12 7.43
          std::vector<DictionaryEntry,…>::insert(..etc2..) 0.12 7.19
      std::vector<DictionaryEntry,…>::~vector<…. >(void) 0.00 7.92
        std::vector<DictionaryEntry,… >::_Destroy(…) 0.00 7.92
          DictionaryEntry::~DictionaryEntry(void) 0.24 7.92
            std::basic_string<…>(void) 0.24 7.19
              std::basic_string< … >::_Tidy(bool) 0.37 6.94
                _free 0.00 6.58
                  FreeHeap 1.34 6.58


Looking at that tree we can make some interesting observations.  First it’s pretty clear that do_in is what we should target. There’s a lot of work going on there, probably more than is necessary.  Goodness knows why InterlockedIncrement and InterlockedDecrement are being called by _Mbrtowc — and that’s adding up to something.  Anyway you slice it going directly to MultiByteToWideChar which is what’s doing the actual work is seeming like a really good idea.   I bet that’s where Raymond goes next.


Also worth noting is that we’re measuing the teardown cost of the dictionary and that is adding 7.92% to the overall compute time almost all of which is freeing the memory.  There’s a good lesson there — many application try very hard to free their memory before they exit but I find this to generally be a total waste of time.  The OS will give that memory back just fine — the last thing you should do is force it all to be loaded back into the processe’s working set and then the processors cache.  It might have been swapped out!   In this case it’s not much anyway though.


We’re working pretty hard and still almost 3x slower than the managed code.  Stay tuned as  Quiz #6 continues to develop 🙂

Comments (10)

  1. Ben says:

    I have a question about the way you are measuring perf. In the managed version, you measure execution by QueryPerformanceCounter. In the unmanaged version, you’re using a profiler. What is the difference? Does the unmanaged profiler include process load time? What happens if you use unmanaged QueryPerformanceCounter calls?

  2. ricom says:

    In both versions I’m using total process time for the reported time. The profiler is only being used to see the breakdown. They are both being measured from process launch to process exit.

    The time the program prints (in both cases) is only a fraction of the total run so I’m not using that. Although I suppose I could. But then GetTickCount() isn’t especially accurate.

  3. CN says:

    So, to clarify, you’re not running a "profiling build" for measuring the total time? With the wrong settings, that alone can multiply run time. What’s worse – it can shift which methods are actually consuming the most cycles, which of course is a delicate problem.

  4. ricom says:

    I’d be feeling pretty stupid right now if I had done that 🙂

  5. Ben says:

    Sounds good. Thanks!

  6. Deenayd says:

    Were measurements done including or excluding .exe startup time (before executing first user instruction)?

    Were required libraries already present in memory (loaded by devenv.exe or something)?

    Were required libraries present in disk cache?

    Somehow I think real world scenerio should include a possibility that the dictionary is the first .NET program started (or that previous .NET program was unloaded long time ago).

  7. ricom says:

    My measurements were done by a tool that times from the process start (including time before main, loading etc.) to process exit.

    In all the cases everything in sight was hot in the cache (i.e. I ran the thing several times first before I timed it).

    If things were cold then module and file i/o would tend to dominate all the analysis. A cold analysis is an interesting thing to do as well but that’s not what we’re doing here.

    I totally encourage you to try this experiment on your own hardware, with the tools of your choice, timing the bits you think are most important, in the way you think is most representative. Contrasting points of view would add value to the discussion. All the source is availabe for both flavors and of course the dictionary itself is also available.

  8. Its a shame Raymond isnt reserving a decent amount of space in his vector. The vector stores DictionaryEntry by value so when the vector is resized, those objects are copied, the old ones are destroyed. Which accounts for 8% of the overhead. In the C# version when an ArrayList is resized you’re just shuffling references around which is a lot cheaper.

    He could shave 8% off with a one line change

  9. ricom says:

    Raymond is just going after the biggest things first — which is only reasonable. At the moment he’s still got bigger problems than that 8% overhead. Let’s see what develops 🙂

  10. So I was reading through one of my favorite MSDN blogs (http://blogs.msdn.com/oldnewthing/)

    And he…