Loading the dictionary, part 4: Character conversion redux


Getting rid of getline was a big help, but 480ms is still not quite peppy enough. You need to respond to user actions within a tenth of a second for thing to seem responsive.

Profiling the latest endeavor reveals that 40% of our CPU time is spent in codecvt::in. Some debugging reveals that codecvt::in ultimately calls MultiByteToWideChar but uses it to convert only one or two characters at a time, even though we handed it a whole line.

Let's get rid of codecvt::in and convert the characters ourselves, calling MultiByteToWideChar exactly once to convert the entire line at a single go.

#define CP_BIG5 950

Dictionary::Dictionary()
{
 MappedTextFile mtf(TEXT("cedict.b5"));
 // typedef std::codecvt<wchar_t, char, mbstate_t> widecvt;
 // std::locale l(".950");
 // const widecvt& cvt = _USE(l, widecvt); // use_facet<widecvt>(l);
 const CHAR* pchBuf = mtf.Buffer();
 const CHAR* pchEnd = pchBuf + mtf.Length();
 while (pchBuf < pchEnd) {
  const CHAR* pchEOL = std::find(pchBuf, pchEnd, '\n');
  if (*pchBuf != '#') {
   size_t cchBuf = pchEOL - pchBuf;
   wchar_t* buf = new wchar_t[cchBuf];
   DWORD cchResult = MultiByteToWideChar(CP_BIG5, 0,
                          pchBuf, cchBuf, buf, cchBuf);
   if (cchResult) {
    wstring line(buf, cchResult);
    DictionaryEntry de;
    if (de.Parse(line)) {
     v.push_back(de);
    }
   }
   delete[] buf;
  }
  pchBuf = pchEOL + 1;
 }
}

Instead of using the codecvt::in method to perform character conversion, we go straight to the MultiByteToWideChar function. Notice that we assume that the Big5 string will not generate more Unicode characters than its length in bytes. This happens to be a safe assumption based on our external knowledge of the Big5 encoding. (If the encoding were something else, the assumption may no longer be valid.)

With this change, the dictionary load time has dropped to 240ms (or 300ms if you include the time it takes to destroy the dictionary). That's twice as fast the previous version, but still not quite close enough to the 100ms goal. We still have some work ahead of us.

[Raymond is currently on vacation; this message was pre-recorded.]

Comments (24)
  1. CornedBee says:

    I can’t help but think that so far, all you have demonstrated is how inefficient the MS STL implementation is.

    Perhaps it’s not possible to make it much better. But I don’t really believe that.

  2. Anonymous says:

    C++ and its lack of support of Unicode is a big, big pain… You cannot rely on nothing.

  3. Anonymous says:

    Raymond is definately making some great headway.&amp;nbsp; Having targetted the single-character at a time…

  4. Anonymous says:

    I’ve run the four versions, but get no discernable difference in performance:

    version 3: 210ms

    version 4: 210ms

    (with Borland CPP builder)

  5. Anonymous says:

    niven – sounds like you’re using GetTickCount() to do your perf timing.

    You might want to use the instruction cycle timer or the performance counter instead.

  6. Anonymous says:

    Just for reference here http://blogs.msdn.com/ricom/ the managed unoptimized version (whith plain old stream ReadLine ) clocking at 124ms which seems to me that the small allocations are the weakest part of the standard library.

  7. Anonymous says:

    Dictionary::Dictionary()

    {

    MappedTextFile mtf(TEXT("cedict.b5"));

    const CHAR* pchBuf = mtf.Buffer();

    const CHAR* pchEnd = pchBuf + mtf.Length();

    std::vector<wchar_t> buf;

    while (pchBuf < pchEnd) {

    const CHAR* pchEOL = std::find(pchBuf, pchEnd, ‘n’);

    if (*pchBuf != ‘#’) {

    size_t cchBuf = pchEOL – pchBuf;

    buf.resize(cchBuf + 1); // +1 to prevent zero length buffer, which is undefined.

    DWORD cchResult = MultiByteToWideChar(CP_BIG5, 0,

    pchBuf, cchBuf, &buf[0], cchBuf);

    if (cchResult) {

    wstring line(&buf[0], cchResult);

    DictionaryEntry de;

    if (de.Parse(line)) {

    v.push_back(de);

    }

    }

    }

    pchBuf = pchEOL + 1;

    }

    }

  8. Anonymous says:

    Some notes about code in your example:

    wstring’s constructor could throw

    DictionaryEntry’s contructor could throw

    de.Parse could throw

    v.push_back could throw

    Any of these events will give you memory leak, because nobody will call ‘delete[] buf’.

  9. Anonymous says:

    And finally (I hope):

    it is a good idea to allocate buffer from stack (with ‘_alloca()’) — this will give significant boost in speed depending on type of CRT you are linking.

    To prvent stack overuse(and overflow), each ‘while’ iteration should be contained on separate stack frame — for example containing it in a separate __fastcall __decspec(noinline) function. On modern PCs cost of function call is insignificant in comparison with lock/unlock in malloc (even if it is amortized by any mechanism).

    Basically — just use your mind, Luke… :)

  10. Anonymous says:

    Microsoft’s implementation of _alloca is ass-backwards as can be. They just reused (literally, same address, different symbol names) the code they call on functions with large stacks (which just touches the stack in 4k increments) to implement it. Instead of fixing _alloca, they added another function _resetstkoflw (that looks like it was written by a summer intern) which you have to call in a SEH block that tests for stack overflow.

    It would be interesting to see how much of a further improvement this code gets just by using Doug Lea’s malloc instead of using VC’s allocator and maybe even add STLPort to the mix instead of VC’s STL/string classes.

  11. Anonymous says:
    1. you need to call ‘_resetstkoflw’ ONLY if you wish to recover from stack overflow (which is nonsense by itself from C++’s point of view)

      2. _alloca’s implementation looks reasonable to me… i.e. this intrinsic is doing almost nothing, leaving stack-growth logic to underlying system.

      3. In this specific example stack should not ever overflow, and there should be no need to call ‘_resetstkoflw’, and using _alloca is ok

      4. It is unconvenient to force system to increase stack size, because AFAIK all Windows systems never reclaims committed memory from stack while it’s owner thread is still running.

      > It would be interesting to see how much of a

      > further improvement this code gets just by

      > using Doug Lea’s malloc instead of using VC’s

      > allocator and maybe even add STLPort to the

      > mix instead of VC’s STL/string classes.

      It will be slower. _alloca() is intrinsic and should end up with 1-3 asm commands…

  12. Anonymous says:

    My guess is that in the end we will have something like this:

    open MemoryMappedFile

    Allocate wcharBuffer // big enough for full file

    MultiByteToWideChar // one single conversion

    parse the wcharBuffer and just remember

    pointers inside the wcharBuffer, no allocation,

    no nothing.

    And I bet Raymond has it all written already, just trying to show us how it is done :-)

  13. Mihai,

    Raymond had it written sometime last year (in September or October, IIRC).

    He writes that far ahead.

  14. Anonymous says:

    I wonder if ‘n’ could be a part of multibyte character? if yes — this code is invalid.

    AFAIK, Windows supports only DBCS (i.e. up to 2 bytes per character) — who knows if it is possible for DBCS character to have ‘n’ in second byte?

  15. Anonymous says:

    Hold it, I thought I just had to use exceptions and the world became a better place without me having to do anymore work.

    *snicker*

    Once again showing that switching between return values and exceptions just means that the developer has to check for something different.

  16. Anonymous says:

    > Once again showing that switching between

    > return values and exceptions just means that

    > the developer has to check for something

    > different.

    No — that means that developer must be always **prepared**. :) That is different.

    Other notes:

    – it is a good idea to do ‘v.reserve(N)’ with some meaningfuill value N (in case if ‘v’ is std::vector)

    – if size of v could potentially grow large I’d strongly suggest to use ‘deque’ (in case if ‘v’ is std::vector)

  17. Anonymous says:

    In a production environment, wouldn’t it be more efficient to write a converter to preprocess the file, and make the whole thing unicode to begin with?

  18. I’m not 100% sure (I don’t have my DBCS manual here) but I believe that no characters < 0x1f are legal leading or trailing bytes in Big5

  19. Anonymous says:

    From <http://www.jbrowse.com/text/charsets.html#Big5&gt;

    The term ‘Big5’ has been abused quite a lot over the years. The original Big5 character repertoire is no longer used and the name ‘Big5’ usually means one of the many extensions. The main extensions are Microsoft’s CP950, Big5-ETen, and Big5-HKSCS.

    Big5 is both a character set and an encoding. As an encoding, it is a DBCS encoding with lead bytes in the range 0xa1-0xf90 and trails 0x40-0xfe.

  20. Anonymous says:

    From <http://www.jbrowse.com/text/charsets.html#Big5&gt;

    > Big5 is both a character set and an encoding.

    > As an encoding, it is a DBCS encoding with

    > lead bytes in the range 0xa1-0xf90 and trails

    > 0x40-0xfe.

    Anyway, it worth pointing out that for example in ‘CP949’ ‘n’ could be a part of DBCS character:

    [—- quote begin —-]

    CP949 charset encoding

    Also called: UHC, Unified Hangul Code

    Languages: Korean

    This is Microsoft’s favored way of representing Korean. It is a derivative of EUC-KR, extended to include all johab precomposed hangul. Like other east asian Microsoft encodings, it allows ASCII trail bytes, and lead bytes in the control code range, thus losing a form of ASCII compatibility.

    [—- quote end —-]

  21. Anonymous says:

    Ah ha… Here’s a table from MSDN listing the lead and trail byte ranges for the various Far East code pages used by Windows:

    http://web.archive.org/web/20000303195204/http://msdn.microsoft.com/library/books/devintl/S24B2_b2.HTM

    (Link at archive.org because it seems it’s no longer on microsoft.com anymore.)

  22. Anonymous says:

    > Anyway, it worth pointing out that for example in ‘CP949’ ‘n’ could be a part of DBCS character

    I don’t think your source is correct. According to http://www.microsoft.com/globaldev/reference/dbcs/949.mspx , lead bytes start at 0x81, and trail bytes (as far as I can tell) go no lower than 0x41.

    http://lists.freebsd.org/pipermail/freebsd-bugs/2003-August/002657.html (random page found via Google) agrees:

    + * CP949 contains four character sets:

    + *

    + * <0x00-0x7f> ASCII equivalent

    + * <0xa1-0xfd><0xa1-0xfe> EUC instance of KS X 1001:2002

    + * <0x81-0xa0><0x41-0xfe> Hangul syllables GAGG-JWAK

    + * <0xa1-0xc6><0x41-0xa0> Hangul syllables JWAT-HIH (ends with C652)

  23. Anonymous says:

    Raymond is definately making some great headway. Having targetted the single-character at a time conversion

Comments are closed.