Very late remarks on the original Chinese dictionary series


I have not forgotten about the Chinese/English dictionary series, but I simply haven't had the motivation to sit down and write up descriptions and discussion for the code that I wrote along the way, so instead of adding to the program, I'm going to answer some questions that were asked back when I started the series but which I didn't respond to at the time since I was out of town.

More than one commenter suggested using v.reserve() to pre-allocate the vector memory. First of all, the cost of vector reallocation really didn't factor into the performance after the first few rounds of optimization, so adding a reservation step ended up being unnecessary. Furthermore, getting the correct value to pass to v.reserve() would mean making two passes over the dictionary, one to get the number of entries in the dictionary and set the vector reservation size, and another to fill the dictionary itself. The alternative would have been to make a guess as to the number of entries in the dictionary based on the total file size and the average length of each entry. Fortunately, it never came to that.

Another commenter suggested preprocessing the file. That is also a valid technique, but I intentionally avoided it partly for expository purposes (it would have removed much of the challenge), and partly because I wanted to be able to update the dictionary by merely replacing the dict.b5 file.

Commenter CornedBee suggested using the wcsrchr function as an alternative to the missing std::rfind method. Note, however, that the DirctionaryEntry::Parse method takes a string in the form of a start and end; it is not a null-terminated string. Passing this to wcsrchr would have resulted in quite undesirable behavior.

Comments (5)
  1. Andy says:

    So far this has been one of my all time favorites of your post themes. I eargly await every installment in this series.

  2. Frank says:

    Andy: I second that.

    The series was interesting to me in three ways: it had to deal with character encodings that I have somehow mercifully managed to avoid in my work, dealt with performance optimizations, and it dealt with nice Win32 GUI tricks to create serve the user better.

    I would love to see more!

  3. Cooney says:

    You can still preprocess the file and make the upgrade seamless: all you have to do is come up with a new extesion, say .b5-chewed and process the .b5 file into .b5-chewed every time it’s newer than the .b5-chewed file.

    Of course, this will increase startup time once in a while. You can make that a win by telling the user that you noticed the new file and are chewing on it. After all, they just stuck the new file in there, right?

  4. This assumes of course that you have write permission into the directory that contains the raw data. If the administrator updates the b5 file, the user won’t be able to save out the "chewed" file.

    (If you respond that the chewed file should go into a separate directory, then you have the problem of what to do when there are multiple b5 files in the system.)

    But the real reason was simply that I didn’t want to create another file and have to manage it.

  5. Miral says:

    "Commenter CornedBee suggested using the wcsrchr function as an alternative to the missing std::rfind method."

    But, like Anders Dalvander pointed out, you don’t need std::rfind — just use a std::find and std::reverse_iterator combo.

Comments are closed.

Skip to main content