Loading the dictionary, part 2: Character conversion


When you want to optimize a program, you first need to know where the time is being spent. There's no point optimizing a function that isn't actually responsible for your poor performance. For example, if a particular function is responsible for 2% of your CPU time, then even if you optimized it down to infinite speed, your program would speed up at best by only little over 2%. In the comments to yesterday's entry, several people put forth suggestions as to how the program could be optimized, in the process quite amply demonstrating this principle. None of the people who made suggestions actually investigated the program to see where the glaring bottleneck was.

(Rico Mariani points out that you also need to take performance in account when doing high level designs, choosing algorithms and data structures that are suitable for the level of performance you need. If profiling reveals that a fundamental design decision is responsible for a performance bottleneck, you're in big trouble. You will see this sort of performance-guided design as the program develops. And you should check out Performance Quiz #6 which starts with the very program we're developing here.)

Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in getline. Clearly this is where the focus needs to be. Everything else is just noise.

Digging a little deeper, it turns out that 29% of the CPU time was spent by getline doing character set conversion in codecvt::do_in. Some debugging revealed that codecvt::do_in was being called millions of times, each time converting just one or two characters. In fact, for each character in the file, codecvt::do_in was called once and sometimes twice!

Let's get rid of the piecemeal character set conversion and instead convert entire lines at a time.

Dictionary::Dictionary()
{
 std::ifstream src;
 typedef std::codecvt<wchar_t, char, mbstate_t> widecvt;
 std::locale l(".950");
 const widecvt& cvt = _USE(l, widecvt); // use_facet<widecvt>(l);
 src.open("cedict.b5");
 string s;
 while (getline(src, s)) {
  if (s.length() > 0 && s[0] != L'#') {
   wchar_t* buf = new wchar_t[s.length()];
   mbstate_t state = 0;
   char* nextsrc;
   wchar_t* nextto;
   if (cvt.in(state, s.data(), s.data() + s.length(), nextsrc,
                   buf, buf + s.length(), nextto) == widecvt::ok) {
    wstring line(buf, nextto - buf);
    DictionaryEntry de;
    if (de.Parse(line)) {
     v.push_back(de);
    }
   }
   delete[] buf;
  }
 }
}

Instead of using a wifstream, we just use a non-Unicode ifstream and convert each line to Unicode manually. Doing it a line at a time rather than a character at a time, we hope, will be more efficient.

We ask code page 950 for a converter, which we call cvt. Notice that the Microsoft C++ compiler requires you to use the strange _USE macro instead of the more traditional use_facet.

For each line that isn't a comment, we convert it to Unicode. Our lives are complicated by the fact that codecvt::in requires pointers to elements rather than iterators, which means that we can't use a wstring or a vector; we need a plain boring wchar_t[] array. (Notice that we can cheat on the "from" buffer and use the string::data() function to get at a read-only array representation of the string.) If the conversion succeeds, we convert the array into a proper string and continue as before.

With this tweak, the program now loads the dictionary in 1120ms (or 1180ms if you include the time it takes to destroy the dictionary). That's nearly twice as fast as the previous version.

You might think that we could avoid redundant allocations by caching the temporary conversion buffer between lines. I tried that, and surprisingly, it actually slowed the program down by 10ms. Such is the counter-intuitive world of optimization. That's why it's important to identify your bottlenecks via measurement instead of just guessing at them.

Comments (24)
  1. Jack Mathews says:

    The only real complaint I have about this is the naked pointer usage.

    When you new in a naked pointer like that, you kill exception safety. Let’s say that widecvt decides to throw an exception (or someone else comes in to fix your code and adds something that can throw an exception in the middle). Your "new" there just created a memory leak.

    If you code in a way where you use vectors or auto_ptr’s instead of naked pointers, then you’re always exception safe without leaks. Exception safe C++ coding has increased our code quality and decreased memory leaks a very, VERY significant amount.

  2. I’m curious to know what it is that you used to profile you’re code? The only profiler I’ve ever used is Shark for the Mac — did you use something similar to that?

  3. Tito says:

    or check out boost’s scoped_ptr<> rather than auto_ptr<>. (see http://www.boost.org)

    Either way using an raii-style pointer would make the code more readable, since you wouldn’t need the delete[], which is quite irrelevant to the logic you are demonstrating.

  4. tomw says:

    I’m w/ Aaron – I’d love to hear a little more about how you did: "Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in getline. Clearly this is where the focus needs to be. Everything else is just noise.

    Digging a little deeper, it turns out that 29% of the CPU time was spent by getline doing character set conversion in codecvt::do_in. Some debugging revealed that codecvt::do_in was being called millions of times, each time converting just one or two characters. In fact, for each character in the file, codecvt::do_in was called once and sometimes twice!"

  5. Jerry Pisk says:

    I wonder how can a reusable buffer be slower than calling new and delete every iteration. Especially if you hit your longest line soon in the beginning, or pad it enough so most of your iterations will simply execute a compare (to make sure the buffer is long enough) and go on with their lives. There’s no way a conditional jump is slower than allocating and freeing memory on the heap.

  6. Excellent point about exception safety. I don’t normally program with STL or exceptions – the fact that I’m using STL at all is a bit of a departure.

    If you want more details on using the profiler that comes with VSTS, check out Ian Huff’s blog http://blogs.msdn.com/ianhu/ and his video on Channel9.

    I don’t know why reusing the buffer made the program slowr but it did. Maybe there are cache effects?

  7. ... says:

    Our lives are complicated by the fact that codecvt::in requires pointers to elements rather than iterators, which means that we can’t use a wstring or a vector;

    What prevents you from using std::vector<wchar_t>?

  8. I don’t see a vector::data() function, nor do I see any guarantee that the elements of a vector are contiguous in memory or that vector::begin() can be treated as a pointer to the first element of the array. Maybe I missed it.

  9. Mat Booth says:

    According to the comp.lang.c++ FAQ, the storage for a std::vector<T> is guaranteed to be contiguous. See http://www.parashift.com/c++-faq-lite/containers.html#faq-34.3

  10. indranil banerjee says:

    The standard was amended to make it clear that vector is required to use contiguous memory

    http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html

  11. "I don’t see a vector::data() function, nor do I see any guarantee that the elements of a vector are contiguous in memory or that vector::begin() can be treated as a pointer to the first element of the array. Maybe I missed it."

    vector<T>::begin() should not be treated as a pointer. &v[0] is, though.

    In the original C++ standard (14882-1998) vector was not explicitly required to have contiguous memory, but the complexity requirements on it pretty much ensured it. In the 2003 revision of the standard, they made that requirement explicit.

    I don’t know of a single STL implementation that has a non-contiguous vector, so it’s pretty safe.

    FYI, do you know if widecvt outputs UCS-2 or UTF-16? If it’s UTF-16, then it’s entirely possible that you could have surrogate pairs (two wchar_ts that represent a single Unicode codepoint).

  12. Universalis, most Win32 APIs expect/use UTF-16 when they recieve or return WCHARs. I have no doubt that there are many apps out there that do not anticipate this.

    A problem that’s even more widespread are apps that expect every Unicode codepoint to be a single grapheme cluster, and thus don’t support combining characters. A good example is IE6 — the Address box handles them correctly, since it is a standard system-wide text edit control. The windowless controls used for form input in pages, however, do not handle combining characters properly. (And Firefox is absolutely broken across the board on this.)

    Michael Kaplan touched on this a while ago in his blog.

  13. To clarify, though, an app that is properly written other than expecting UCS-2 will simply have a broken UI, not a crash.

    In UCS-2, the surrogate pair characters are simply two separate characters. So, if you pass UCS-2 code a surrogate pair, it’ll expect a string that’s longer, and render two characters. This could cause some display bugs, of course, but nothing should be fatal.

    This is a good thing, because we didn’t really support UTF-16 in the UI until WinXP and its ClearType engine; Win2K, for all intents and purposes, acted as though wchar_t/WCHAR was UCS-2.

  14. You might think that we could avoid redundant allocations by caching the temporary conversion buffer between lines. I tried that, and surprisingly, it actually slowed the program down by 10ms. Such is the counter-intuitive world of optimization.

    The fact is that CRT is not required to actually return the memory to the OS after a call to free/delete. It can do its own cashing.

    As for the exception safety in the sample, the rule of thumb is: whenever you manually call delete out of a destructor, chances are your code is not exception-safe. Using std::vector or (if you believe it is too much overhead – talk about premature optimization) booost::scoped_array is the key for making the code exception-safe.

  15. Wiill Dean says:

    ‘Slowed down by 10ms’ is almost certainly completely meaningless, because your GetTickCount resolution is very likely to be 10ms.

  16. Kevin Gadd says:

    Will Dean: Why on earth would a profiler use GetTickCount? Mine uses QueryPerformanceCounter, which has no such accuracy limitation.

  17. Justin Olbrantz says:

    I usually profile my functions with the CPU cycle counter on the average of many repetitions, running at realtime priority class. That gives me about +/- 15 cycles accuracy. But that’s for individual, small functions, not something anywhere near 1120 ms (on my computer, monopolizing the single CPU for even half that long kills the computer).

  18. Universalis says:

    That is an excellent point that Ryan Myers has just made. For some reason people (at least, people in a hurry) seem to think that "Unicode" means "16-bit" whereas of course it doesn’t. I was facing just this problem in a database program that needed to do boil-to-an-indexable-form stuff the whole time, and I ended up making the [deeply] internal character representation 32-bit. I know that the supplementary characters are most of them pretty obscure, but the prospect of a program collapsing or producing unintended behaviour just because – years down the line – some non-technical user thinks innocently that "a character is a character" (without appreciating that some of them don’t fit into 16 bits) is really unacceptable. It’s a 20-year-old program I’m working with and it is entirely possible that it will still be in use in 20 years’ time.

    For the purpose of the example program whose evolution we’re watching this may be needless and quixotic, but I think it’s important for any seriously long-lived program to be able to handle the full Unicode character set even if this means a certain va-et-vient between UTF-8 (best on disk), UTF-16 (which I *think* the Windows API uses, but please correct me) and UCS-4 (which we know can accommodate everything).

    If some algorithms are slowed unacceptably by the 32-bit character representation, then later, as an optimisation stage, you might produce 16-bit implementations that are rigorously hidden from the caller – the internal logic could then be "call 16-bit version and if that fails then call 32-bit version"… this optimisation being "under the surface" and not available to an inappropriately ingenious caller.

  19. Will Dean says:

    "Why on earth would a profiler use GetTickCount?": If you look at Raymond’s code, posted previously, you’ll see that he’s using GetTickCount to get the headline ‘total time’ figures that he quotes throughout the articles.

  20. Raymond, three things:

    1. Your code is not exception-safe. Variable buf will leak if an exception is thrown inside the loop.

    2. Re the temporary buffer allocation performance: A better solution may be to use a class that contains a default buffer which is then allocated on the stack when, but then allocates the buffer on the heap if the default buffer is not big enough.

    3. An unfortunate result of the C++ standard rules concerning vectors is that all emements have to be in a contiguous piece of memory. Thus it’s valid to obtain pointers to various elements and pass them to codecvt::in. I don’t know if the same applies to basic_string; maybe Herb Sutter can sched some light.

    Dejan

  21. Will Dean says:

    std::basic_string is not required to be contiguous. std::vector is.

  22. Arta says:

    People seem to be making noises about allocating and deleting the buffer on each loop iteration, without suggesting that the profiler be used to see if that’s really a bottleneck…

  23. Stefang jumped into the fray with his analysis in the comments from my last posting.&amp;nbsp; Thank you…

  24. Neil says:

    Is the (lack of) performance of std::wifstream a fault of the library or a fault of the standard?

Comments are closed.