Loading the dictionary, part 1: Starting point


The first thing we'll need to do in our little dictionary program is to load the dictionary into memory. The format of the dictionary file is as a plain text file, each line of which is of the form

Chinese [pinyin] /English 1/English 2/.../
舉例 [ju3 li4] /to give an example/

Since it was the Big5 dictionary we downloaded, the Chinese characters are in Big5 format, known to Windows as code page 950. Our program will be Unicode, so we'll have to convert it as we load the dictionary. Yes, I could've used the Unicode version of the dictionary, but it so happens that when I set out to write this program, there was no Unicode version available. Fortunately, this oversight opened up the opportunity to illustrate some other programming decisions and techniques.

The first stage in our series of exercises will be loading the dictionary into memory.

#define UNICODE
#define _UNICODE
#include <windows.h>
#include <string>
#include <fstream>
#include <iostream> // for cin/cout
#include <vector>

using std::string;
using std::wstring;
using std::vector;

struct DictionaryEntry
{
 bool Parse(const wstring& line);
 wstring trad;
 wstring simp;
 wstring pinyin;
 wstring english;
};

bool DictionaryEntry::Parse(const wstring& line)
{
    wstring::size_type start = 0;
    wstring::size_type end = line.find(L' ', start);
    if (end == wstring::npos) return false;
    trad.assign(line, start, end);
    start = line.find(L'[', end);
    if (start == wstring::npos) return false;
    end = line.find(L']', ++start);
    if (end == wstring::npos) return false;
    pinyin.assign(line, start, end - start);
    start = line.find(L'/', end);
    if (start == wstring::npos) return false;
    start++;
    end = line.rfind(L'/');
    if (end == wstring::npos) return false;
    if (end <= start) return false;
    english.assign(line, start, end-start);
    return true;
}

class Dictionary
{
public:
 Dictionary();
 int Length() { return v.size(); }
 const DictionaryEntry& Item(int i) { return v[i]; }
private:
 vector<DictionaryEntry> v;
};

Dictionary::Dictionary()
{
 std::wifstream src;
 src.imbue(std::locale(".950"));
 src.open("cedict.b5");
 wstring s;
 while (getline(src, s)) {
  if (s.length() > 0 && s[0] != L'#') {
   DictionaryEntry de;
   if (de.Parse(s)) {
    v.push_back(de);
   }
  }
 }
}

int __cdecl main(int argc, const char* argv[])
{
 DWORD dw = GetTickCount();
 {
  Dictionary dict;
  std::cout << dict.Length() << std::endl;
  std::cout << GetTickCount() - dw << std::endl;
 }
 std::cout << GetTickCount() - dw << std::endl;
 return 0;
}

Our dictionary is just a list of words with their English definitions. The Chinese words are written in three forms (traditional Chinese, simplified Chinese, and Pinyin romanization). For those who are curious, there are two writing systems for the Mandarin Chinese language and two phonetic systems. Which one a particular Mandarin-speaking population follows depends on whether they fell under the influence of China's language reform of 1956. Traditional Chinese characters and the Bopomo phonetic system (also called Bopomofo) are used on Taiwan; simplified Chinese characters and the Pinyin system are used in China. Converting Pinyin to Bopomo isn't interesting, so I've removed that part from the program I'm presenting here.

(The schism in the spelling of the English language follows a similar pattern. Under the leadership of Noah Webster, the United States underwent its own spelling reform, but countries which were under the influence of the British crown retained the traditional spellings. Spelling reform continues in other languages even today, and the subject is almost always highly contentious, with traditionalists and reformists pitted against each other in a battle over a language's—and by proxy, a culture's—identity.)

The program itself is fairly straightforward. It creates a Unicode file stream wifstream and "imbues" it with code page 950 (Big5). This instructs the runtime to interpret the bytes of the file interpreted in the specified code page. We read strings out of the file, ignore the comments, and parse the rest, appending them to our vector of dictionary entries.

Parsing the line consists of finding the spaces, brackets, and slashes, and splitting the line into the traditional Chinese, Pinyin, and English components. (We'll deal with simplified Chinese later.)

When I run this program on my machine, the dictionary loads in 2080ms (or 2140ms if you include the time to run the destructor). This is an unacceptably long startup time, so the first order of business is to make startup faster. That will be the focus of this stage.

Notice that as a sanity check, I print the total number of words in the dictionary. The number should match the number of lines in the cedict.b5 file (minus the one comment line). If not, then I know that something went wrong. This is an important sanity check: You might make a performance optimization that looks great when you run it past a stopwatch, only to discover that your "optimization" actually introduced a bug. For example, one of my attempted optimizations of this program resulted in a phenomenal tenfold speedup, but only because of a bug that caused it to think it was finished when it had in reality processed only 10% of the dictionary!

As my colleague Rico Mariani is fond of saying, "It's easy to make it fast if it doesn't have to work!"

Comments (33)
  1. Anonymous says:

    Raymond Chen is going to be developing a Chinese dictionary over the next while. This is a really cool…

  2. Anonymous says:

    Makes me wish I knew some Chinese. Could you do one in Russian? Just kidding. Not quite the same complexity, eh? This will be really interesting. I love the blog, Raymond.

  3. Anonymous says:

    okay, so this doesn’t work for me; converting the first non-ascii character (0xa8) causes a conversion error. in xmbtowc.c, _Mbrtowc() is using the macro _cpp_isleadbyte(), which uses the _pctype pointer from ctype.c. if i’m reading this right, __init_ctype() is supposed to set _pctype to a newly-created table for the codepage. the problem is that __init_ctype() gets called twice: inside the std::locale constructor, a temporary _Locinfo object gets used, and in its destructor it seems to restore the original locale, causing __init_ctype() to be called for the original "C" locale. subsequently, neither __init_ctype() nor setlocale() is called, so the locale is the plain "C" locale. am i missing something?

  4. Anonymous says:

    Raymond,

    Perhaps this could be made faster if you used v.reserve(…) to avoid realloc’s on push_back? This would require some knowledge of the eventual size of the Dictionary.

    -Mark

  5. Anonymous says:

    I had the same problems, but I found a workaround.

    Add this line to the constructor:

    std::locale old = std::locale::global(std::locale(".950"));

  6. Anonymous says:

    The first order of business is to optimize the startup time? Hmmm, isn’t that a bit premature? :)

    Jokes aside, this looks like an intersting project, I’ll be sure to keep an eye on it. Excellent blog, keep up the good work.

  7. Mike Dunn says:

    hmm, VC6 is giving me no love with this. dict.Length() returns 1184, but it sure runs blindingly fast! ;)

  8. Anonymous says:

    I’m curious, but I’m sure I’ll get an answer tomorrow: are you intending on making the input of the stream be processed faster? or defering it to a worker thread so that you can display the UI before the file is completely there?

    I’m personally quite a big fan of ‘few error/status messages’. What I mean by that is that I personally would take the path of having the UI show a ‘searching’ status sign when a word is searched for, and use that same status display when the in memory database is locked because it is loading. The resulting scenario would be that if I launch my app and type in a word right away, I would get a ‘searching…’ indicator until the db had fully loaded (as opposed to a ‘loading db’ indicator).

    That’s just me, I’m hyped to see the rest!

  9. Anonymous says:

    I would just prefer using a database.

  10. indranilbanerjee says:

    Premature optimisation is the root of all evil, so here goes…

    The biggest and most immediate bang for you buck is to reserve a vector of reasonable size. For big dictionaries it will make a huge difference.

    Secondly DictionaryEntry has a member called simp which does nothing but take up space, so you could lose that,

    If there was still a problem. Then I would reduce the number of DictionaryEntry objects created and copied around.

    Move the parsing code into a non member function which returned the positions of the substrings.

    <pre>

    if(Parse(line, space, lbrace, rbrace, firstslash, lastslash) {

    v.push_back( DictionaryEntry(line.substr(0, space),….);

    }

    </pre>

  11. Anonymous says:

    Did you really test the code?

    Here are some timings:

    No changes: 471 ms

    Removing the simp member: 471 ms

    Reserving space in vector: 421 ms

    Didn’t test the string thing, but if I comment out the entire if-statement and just blast through the file with getline it still takes 361 ms. Unless you change the way the file is read, you won’t get it much faster than that.

    -Daniel

  12. indranilbanerjee says:

    Doesnt a Dictionary involve doing lookups? If so why a vector? Surely a map is more appropriate?

  13. indranilbanerjee: Think it through. Suppose physical dictionaries worked like that.

  14. indranilbanerjee says:

    "Think it through. Suppose physical dictionaries worked like that. "

    With a book I can add bookmarks and if I’m looking up a word beginning with Z I can start looking near the back.

    I guess you can do the same with a random access container like vector. Not quite the same though

  15. Substring searches are kind of hard with maps, too. And iterating through a map has poor locality — we’ll see the impact of that in autumn.

  16. Anonymous says:

    I love it when you say things like that. Cracks me up, especially when the time comes. I bet you even have the precise date already.

    I wonder if disk fragmentation would make much difference in a file this size. Maybe I should find out.

  17. Anonymous says:

    Well, I compiled and ran the program, it bailed after reading 1184 entries, apparently getline() failed on the entry after bao3 zhang4. Stepping into STL code was useless, since STL code was too difficult to figure out.

    So I rewrote Dictionary::Dictionary() to use FILE and call MultiByteToWideChar() myself, and it worked fine, and it also seemed a lot faster than the IO stream version.

  18. Anonymous says:

    Yes, B.Y.

    I agree that plain ReadFile, with API functions to translate encodings have always been almost an order of magnitude faster than the stream implementations of STL.

    But then again, the complexity of the project increases.

  19. Mike Dunn says:

    Without doing any measurements (sorry Rico) I’d wager that the getline() is a big bottleneck.

    A while ago I did something similar (line-by-line file reads) and I did it by inputting from a file into a stringstream. This was incredibly slow in some cases, and I figured out (after decrypting the STL source) that it was doing character-by-character appends to the string that the stringstream manages. This was causing a realloc on every character. UGH!

    So I fixed it up by pre-calculating the length of each line and calling something like assign() (I forget what I did exactly) that takes a start/end point. That way there was just one string alloc per line.

  20. Anonymous says:

    Interesting topic. I look forward to the next post.

  21. Anonymous says:

    There is a mob who have a CEDICT based dictionary available here:

    http://www.dcelan-software.com/chinese

    It is pretty cool.

  22. Anonymous says:

    Can’t compile this with MingW – Is wifstream part of standard C++?

  23. Anonymous says:

    When you say you need a "sanity check", Raymond, the need for unit tests for this code screams out at me. Why don’t you use CppUnit, or even just write a class yourself, to test the code?

    As you’ve described in your sanity check, it’s easy to write a simple test for this. Run your startup code and verify that the output (or even the vector itself) has the correct number of lines. Surely this will be faster, more reliable, and more likely to be used than a manual test.

    Regards,

    Matt Ryall

  24. If you want to write a CppUnit module, be my guest. But it would interrupt the flow of the article series. And writing a unit test to verify that the number of entries in the vector is correct seems like overkill for such a small program. Especially considering how long it would take to explain the unit testing framework – and of course people would then argue with my choice of unit testing framework. I just avoid the framework problem by writing raw Win32.

    Remember people, this is a blog, not a book on software engineering.

  25. Anonymous says:

    You have load nearly every byte in the file into memory, so why not just load the whole into memory, into a big wchar_t buffer? Then instead of wstrings in the dictionary entry, use pointers into the memory buffer. So instead of:

    wstring english;

    Use

    wchar_t* english_begin;

    wchar_t* english_end;

    This will reduce the code to nearly zero memory copies.

    Maybe using a memory-mapped file would also help? I dunno. I’ve never used one.

  26. Anonymous says:

    I am guessing he preprocesses the dict file and puts it in some sort of binary format. No parsing needed.

  27. Anonymous says:

    I’m thinking the following would create a serious speedup:

    – Read the file in chunks (read 2048 bytes)

    – Then find the EOL for the current pos (append the chars to the chunk)

    – Then prase the chunk (handle the eol in the memory)

    – Loop this until last chunk

    This will decrease the total diskreads(1 char at a time) and realocs(append 1 char at a time).

  28. wojtek says:

    I tried the same technique that Stuard, I memory mapped the file. Then converted to Unicode in one shot. I also played with data structure. Instead of wstring objects I tried good old wchar_t* pointers. During parsing I set these pointers to memory in Unicode buffer that I allocated anyway for MultiByteToWideChar call. I managed to come from around 700ms (original code) to ~30ms. Well, 700ms is my approximation, I also got only 1184 lines parsed by original code, as mentioned in previous posts but I didn’t bother to find out why. Maybe its VS60 fault? I bet Raymond did tested it as he writes:).

    I also tried Stuard’s variant 1 and I got ~70ms.

    For this size of file extra amount of memory needed isn’t that significant. I needed buffer exactly 1074434 bytes long for Unicode conversion, that’s not a big deal.

  29. Anonymous says:

    In Dictionary::Item(), instead if returning "v[i]", you might consider "v.at(i)". The latter does range checking and throws an exception on an out-of-range error. The former may return something (not anything useful) on an out of range index. There is a slight performance hit, but the safety is usually worth it.

  30. Anonymous says:

    I memory map lots of stuff in my apps. The facility is there and seems to bypass many potential sources of speed loss. You dont clutter the system page file allocating space for your app to get a copy of the data. If multiple instances of your app load up, the mapping is potentially shared.

    You also avoid invoking ReadFile which is going to have multiple memory to memory copies (Im still not entirely clear on where Win32 keeps its disk cache, but suspect that memory mapped files avoid excercising that overly much too).

    (Plus, memory mapping makes for efficient CE apps where your data files are in memory anyway and loading data the traditional way is explicitly wasteful).

  31. Anonymous says:

    OK – so I tried my favourite strategy when presented with the challenge of optimising the parsing of a large text file – memory map it!

    I tried three variants:

    1. Map the file, but keep Raymonds DictionaryEntry::Parse method (i.e. allocate a string for each line on the fly).

    2. As 1, but change DictionaryEntry::Parse to use a pair of pointers to define the line and use std::find rather than std::wstring::find (i.e. always operate on the converted file mapping).

    3. As 2, but store pointer pairs instead of strings for each item in a DictionaryEntry. You obviously need to keep the converted file mapping (see below) allocated through the Dictionary life for this to work!

    In all cases, I manually detect line endings by looking for either of ‘n’ or ‘r’. Also, you need to create a wide-string equivalent of the file mapping by performing MultiByteToWideString on the whole file, which incurs a significant memory overhead compared to the std::wifstream approach.

    As expected, improved performance was exhibited:

    Original code: 575ms

    Variant 1: 227ms

    Variant 2: 199ms

    Variant 3: 65ms

    All timings are total elapsed times, averaged over 10 executions of the code. All code was compiled with VC++ 7.1 with a command line of ‘cl -EHsc -O2 <filename>’.

    Conclusions?

    1. I suspect calling MultiByteToWideString individually for each character in a file isn’t as time-efficient as calling it once for the whole file – what’s the per-invocation overhead of that function?

    2. std::streams appear to add *a lot* of overhead, looking at the source.

    3. Minimising the number of memory allocations/deallocations doesn’t hurt – what a surprise :-)

  32. Anonymous says:

    MingW does seem to miss the wifstream definition. (Had the same problem with Dev-C++).

    The Borland Cpp Builder compiles the program without problems (it is free and only an 8mb donwload)

Comments are closed.