Loading the dictionary, part 5: Avoiding string copying


Looking at the profile for our program so far, 35% of the CPU time is spent copying strings around. Let's see if we can improve that.

The best way to speed up copying strings is not to copy them in the first place. Using a wstring in our DictionaryEntry structure forces the vector class to copy the string data, when all we really need to copy is the pointer and size information. The actual characters can stay put. C++ has a copy constructor but not a "move" constructor.

Let's use plain string pointers rather than wstring objects. The "copy constructor" for a string pointer is just to copy the pointer—exactly what we want here.

struct DictionaryEntry
{
 bool Parse(const WCHAR* begin, const WCHAR* end);
 void Destruct() {
  delete[] m_pszTrad;
  delete[] m_pszSimp;
  delete[] m_pszPinyin;
  delete[] m_pszEnglish;
 }
 LPWSTR m_pszTrad;
 LPWSTR m_pszSimp;
 LPWSTR m_pszPinyin;
 LPWSTR m_pszEnglish;
};

The DictionaryEntry is no longer a structure of wstrings but rather is just a structure of LPWSTRs, which copy much faster. The cost for this, however, is having to free all the strings manually in the dictionary destructor (which we will see below).

Since we aren't using wstrings any more, we need to allocate the memory for the strings and copy them the old fashioned way.

LPWSTR AllocString(const WCHAR* begin, const WCHAR* end)
{
 int cch = end - begin + 1;
 LPWSTR psz = new WCHAR[cch];
 lstrcpynW(psz, begin, cch);
 return psz;
}

bool DictionaryEntry::Parse(
       const WCHAR* begin, const WCHAR* end)
{
 const WCHAR* pch = std::find(begin, end, L' ');
 if (pch >= end) return false;
 m_pszTrad = AllocString(begin, pch);
 begin = std::find(pch, end, L'[') + 1;
 if (begin >= end) return false;
 pch = std::find(begin, end, L']');
 if (pch >= end) return false;
 m_pszPinyin = AllocString(begin, pch);
 begin = std::find(pch, end, L'/') + 1;
 if (begin >= end) return false;
 for (pch = end; *--pch != L'/'; ) { }
 if (begin >= pch) return false;
 m_pszEnglish = AllocString(begin, pch);
 return true;
}

There isn't a std::rfind function, so I coded up a backwards-search-for-slash loop inline. Exercise: Why don't I have to check that pch hasn't underflowed beyond the beginning of the string?

class Dictionary
{
public:
 Dictionary();
 ~Dictionary();
 int Length() { return v.size(); }
 const DictionaryEntry& Item(int i) { return v[i]; }
private:
 vector<DictionaryEntry> v;
};

Dictionary::Dictionary()
{
 ...
   if (cchResult){
    // wstring line(buf, cchResult);
    DictionaryEntry de;
    if (de.Parse(buf, buf + cchResult)) {
     ...
}

Dictionary::~Dictionary()
{
 for (vector<DictionaryEntry>::iterator i = v.begin();
      i != v.end(); i++) {
  i->Destruct();
 }
}

The last bits of the change are to get rid of the temporary wstring we parsed from, since we don't need it any more, and to free all the strings in the Dictionary's destructor.

This new program clocks in at 120ms (or 290ms if you include the time it takes to destroy the dictionary). Again, we're twice as fast as the previous version, but still haven't quite crossed the 100ms barrier. And the time to destroy the dictionary isn't starting to be a more significant factor. But I have another trick up my sleeve.

[Raymond is currently on vacation; this message was pre-recorded.]

Comments (28)
  1. Doug says:

    What is amusing is we are slowly stripping out all the object oriented ease-of-use features, and getting down to some hard, optimized code. I suspect by the time this is done, the OO code will be pretty small… :)

  2. Sriram says:

    I remember Larry Osterman making a comment about the standard libraries disappearing from this sample :)

  3. Anthony Wieser says:

    I hope I don’t spoil the fun, but as we’ve arrived where we are now, I get the slight feeling of I wouldn’t want to get there from here.

    Wouldn’t it have been better to open the file as memory mapped, then convert the whole thing to unicode with a single call on the length of the whole buffer, having allocated another buffer double the size, and then walk through the converted data in place, marking end of strings with 00 00 pairs, and setting up the pointers in the dictionary?

    Then the whole thing can be destroyed with a single delete, as there is nothing extra to clean up, no individual allocations from the heap, etc.

  4. nksingh says:

    What’s the point of destroying anything while you’re quitting the program? I suppose that’s one way of getting rid of all of the destruction time.

    I bet allocation time can be improved by grabbing a large chunk of memory at the beginning and then parcelling it out in the AllocString function.

  5. Tito says:

    Why not use smart pointers?

    boost::shared_array<wchat_t>?

    or perhaps boost::shared_ptr<std::wstring>?

    it might be a little slower, but you’d have much cleaner code.

    (www.boost.org for those who don’t know)

  6. Ben Hutchings says:

    Copying the strings wouldn’t be a problem in VC++ 6 since its std::basic_string implementation uses shared reference-counted buffers. VC++ 7.1 (maybe 7.0 as well?) doesn’t do that, which is good for thread-safety but sometimes bad for memory and processor efficiency. It does have the "small string optimisation", avoiding the need for (de)allocation when copying short strings, but that unfortunately bloats the string objects (to 32 bytes each, I think) and the DictionaryEntry objects (128 bytes?!).

  7. Stewart Tootill says:

    Hopefully C++0x is going to address the "move" constructor issue with the new r-value references. The STL will then be changed to use this when resizing vectors and you’ll be able to get performance like this without losing the nice stl syntax.

    Shame we’ll all have retired by the time we get C++0x though.

  8. Ben Hutchings says:

    Stewart: Move constructors are a great idea but they probably wouldn’t help here as most of the strings are short enough to benefit from the "small string" optimisation anyway.

  9. CornedBee says:

    There isn’t a std::rfind function

    And what’s wrong with wcsrwc or whatever the wide-character version of strrchr is called?

  10. Dave says:

    a couple of minor nits: first, m_pszSimp is uninitialized, which causes the delete[] to choke.

    second, if you fail to parse a DirectoryEntry, Destruct() is never called and any partially-parsed info will be leaked.

    otherwise, though, great series so far, and i’m looking forward to the rest.

  11. Aleko Petkov says:

    I notice a general consensus among the commenters that people are not entirely happy with the direction that this code is moving in. As the code gains speed, it loses readability. We seem to be moving closer and closer to writing plain C. I wouldn’t be surprised to see this program grow a customized memory allocator, or string class.

    As a C++ programmer I recognize this pattern. It seems writing in C++ often comes down to (re)writing things yourself. While there is nothing wrong with this, and it can be fun to do, it is certainly not productive. Think how much more features could have been added to this program in the time spent optimizing.

    As a comparison, look at Rico Mariani’s blog, which has had an equivalent program in C# for more than a week now, and even in its original incarnation it’s still faster than the C++ version, and significantly easier to brain-parse.

    I don’t mean to put down Raymond’s efforts. I myself love reading (and writing) this low-level stuff. This series – indeed the whole blog – is a hacker’s delight. If the purpose of this series is to teach program optimization techniques, it definately succeeds.

    But if your objective is to write functional, maintainable software in a reasonable amount of time, C++ is looking increasingly less attractive.

  12. indranil banerjee says:

    But, but, but… If the vector was large enough you wouldnt have had all that copying of DictionaryEntrys in the first place.

    The best optimisation here would have been to reserve a decent sized vector. And replace wstring with a class like flex_string using the small string optimisation

  13. ricom says:

    I dunno… the vector stuff didn’t seem to be what was killing us in the profile. Sure there’s a cost there but it wasn’t at the top of the list.

    Would we really want to bake in assumptions about how big the dictionary is? The final growth of the vector is the most costly.

    If that were the source of problems I would be switching to an alternate data structure because the vector semantics aren’t especially needed to make the dictionary work. A linked list of dictionary chunks for instance would probably be good. Maybe with an index in front of that.

    I’ll be posting my analysis of this version later today. Hope you’re all having fun with this exercise.

  14. But, but, but…why dont you reserve enough space in your vector up front? That way you wouldnt have had all the string copies in the first place!

    Now the code is seriously messy and will leak a vast amount of memory if there is an exception.

  15. Ken says:

    Denis – STL is used in many high performance products. STL must be used with care, but that is no different than any other framework. It won’t automatically make your code fast, no.

  16. Rick c says:

    Tito,

    Switching a smart pointer class that runs slower would kind of defeat the purpose of speed-optimizing, wouldn’t you agree?

  17. Rick C says:

    Switching to a smart pointer class that runs slower would kind of defeat the purpose of speed-optimizing, wouldn’t you agree?

  18. Ken says:

    What I don’t understand is why, if the goal was simply to avoid copying the wstrings, you jump directly from wstring to WCHAR*, when a much smaller jump from wstring to wstring* would net you the same lack-of-copying benefit without the requirement of rewriting your code. A few extra *’s, ->’s, news and deletes would have enabled you to use the existing code.

  19. Michael Pryhodko says:

    This is an example of how wrong approach to C++ programming could make C looks better :)

    crap, no wonder that C is still so popular in MS.

  20. ASeverin says:

    This is pretty funny, but all you have demonstrated so far is that you should *never* use STL in products that should run in reasonable timeframe.

    IMHO next step will be ‘get rid of AllocString() & Destruct()’ since we have delimiter characters to place ‘’ and can use just pointers to already converted unicode buffer

    /denis

  21. asdf says:

    shared_ptr isn’t just in boost anymore. It’s in C++TR1 due out any time now (see http://aristeia.com/EC3E/TR1_info_frames.html).

  22. Ken says:

    Rico – the data structure you describe (almost) is called std::deque.

    deque (pronounced "deck") is basically a vector of pointers to fixed size chunks. A linked list of pointers wouldn’t allow random access into the container, but a vector of pointers is easy (and fast) to resize – the actual items don’t need to be copied when the container expands, it just has to allocate a new chunk. With large containers, this also eliminates the need for enormous blocks of contiguous memory.

    Using a deque would eliminate all the vector resizing related string copying.

    If the compiler is smart enough, replacing DictionaryEntry::Parse with a constructor that did the same thing and passing in an unnamed DictionaryEntry to the push_back method could eliminate the rest of the string copying. However, that approach would necessitate relying on exceptions to handle unparsable lines, since constructors have to result in a valid object or an exception. Or, have failed parsing result in a sentinel valued DictionaryEntry and pop_back if the most recently inserted DictionaryEntry was a failed parsing.

  23. Anders Dalvander says:

    There isn’t a std::rfind function

    No, and there is no direct need for it either, use std::find with std::reverse_iterator instead.

  24. Ken says:

    Rick C – that depends if the smart pointer is doing more work than you would be otherwise doing yourself.

    shared_ptr does reference counting, which might add a little bit of unnecessary overhead on pointer copies, but I doubt it would hurt a lot. It would certainly be less than the previous cost of copying strings.

    The correct answer, of course, is to profile.

  25. Tim Smith says:

    I’m still wondering what is the point of this series. As others have pointed out, there are some bad and silly things STL does, but many of the optimizations are from what might be a limited understanding of STL. We seem to go from one extreme to another.

    I have been using STL for many years now. I really like it but I understand many of the limitations. Far to many people act like all you have to do is start using technique X and all your problems will be solved. You see this with such things as STL and exceptions. One of the reasons I love reading Herb Sutter’s books is because he doesn’t gloss over the failings of C++ and STL. Often he goes out of his way to tell you why something is the "right thing" to do while telling you the ramifications of doing such things.

  26. James says:

    I see this series as an exercise in showing that there’s much to be said for transparent code. In the final version, you can see exactly what’s happening. In the original version, you can see what the code is supposed to do, but you don’t know what’s going on when it executes, and that’s where the trouble comes from. I’d call the final version of the code simpler and more robust as a result.

  27. Well today Raymond hits a snag in version 5 of the program. I started by profiling the code as provided

Comments are closed.