Developing a Chinese/English dictionary: Introduction


The other day, one of my colleagues mentioned that his English name "Ben" means "stupid" in Chinese: 笨/bèn/ㄅㄣˋ. (His wife is Chinese; that's why he knows this in the first place.) Knowing that the Chinese language is rich in homophones, I fired up my Chinese/English dictionary program to see if we could find anything better. (Unfortunately, the best I could come up with was 賁/贲/bēn/ㄅㄣ, which means "energetic".)

Ben seemed to take his appellative fate in stride; he seemed much more interested in the little dictionary program I had written. So, as an experiment, instead of developing tiny samples that illustrate a very focused topic, I'll develop a somewhat larger-scale program (though still small by modern standards) so you can see how multiple techniques come together. The task will take many stages, some of which may take only a day or two, others of which can take much longer. If a particular stage is more than two or three days long, I'll break it up with other articles, and I'll try to leave some breathing room between stages.

Along the way, we'll learn about owner-data (also known as "virtual") listviews, listview custom-draw, designing for accessibility, window subclassing, laying out child windows, code pages, hotkeys, and optimization.

If you're going to play along at home, beware that you're going to have to install Chinese fonts to see the program as it evolves, and when you're done, you'll have a Chinese/English dictionary program, which probably won't be very useful unless you're actually studying Chinese...

If you're not into Win32 programming at all, then, well, my first comment to you is, "So what are you doing here?" And my second comment is, "I guess you're going to be bored for a while." You may want to go read another blog during those boring stretches, or just turn off the computer and go outside for some fresh air and exercise.

Those who have decided to play along at home will need the following: a copy of the CEDICT Chinese-English dictionary in Big5 format (note: Big5 format) and the Chinese Encoding Converter source code (all we need is the file hcutf8.txt). We'll start digging in next time.

Comments (28)
  1. Anonymous says:

    Maybe also include a link for installing Chinese Fonts?

    I’ll be curious about the optimalizations.

    How long are the "many stages" aprox?

  2. Anonymous says:

    When you say "Win32 programming" does this mean you’re not going to use MFC? If this is correct, I’m curious to know why you’re not using MFC.

  3. Anonymous says:

    Why no MFC? Based on our kind author’s previous writings, I’ve got a few guessese:

    – It’s not as educational. The Old New Thing clearly seems built to scratch the author’s desire to make other people better Windows programmers. Probably because he’s seen so many bad Windows programs. MFC is built on the raw Win32 API. If you understand the raw Win32 API you’ll be better MFC programmer. However, if you start and end with MFC you’ll be completely unable to debug deeper problems.

    – For many applications, raw Win32 is plenty. If the benefit of MFC is minor enough, why not write raw Win32 and benefit from understanding every single line of code in your program. For a simple dictionary program, do you really get any real benefit from a full MVC framework? Do the benefits outweight the piles of automatically generated code and the monstrous supporting libraries you add?

    – Finally, there is a simple worldview issue: Minimize external requirements, understand every line of your code. MFC is a big beast. Using it adds a lot of assumptions to your world. Modifying MFC at a deeper level is very tricky and likely break if you upgrade. The closer you are to the OS, the less details between you and the OS to fight when things go wrong. As someone who tried to teach MFC’s Print Preview code to live in a dialog box, I can appreciate this.

    Of course, this is all my guesswork and (I suspect) no small amount of my own beliefs projected onto the author, so take it with a lump of salt.

  4. Anonymous says:

    Raymond,

    I think it’s great that you’re willing and able to provide so many samples illustrating various, often obscure bits of Windows information. I was just wondering, where do you find the time? :) Do you do it in your spare time, or is it part of your job? Apologies in advance if this is too forward, I just thought that you must spend a significant amount of time on your blog and was curious.

  5. Anonymous says:

    It’d be just my sort of luck if "Reu" (or ‘Roo’, ‘Ru’, etc) turned out to be Chinese for "very"…

  6. Anonymous says:

    How about 奔 (ben1),which means running or striding.

  7. Anonymous says:

    The CEDICT link has UTF-8 already, why convert?

    I presume this will be using WCHAR APIs, and hence NT-only. On the other hand, you mention codepages, so maybe…

  8. Anonymous says:

    Sounds great – can’t wait

  9. Anonymous says:

    Reuben, I don’t know the character off-hand, but I think de-shelled hard boiled eggs cooked in a meat broth are called "ru dan". Not quite sure what the process is called – pickling?

  10. When I originally wrote this program the UTF-8 version wasn’t available.

    Scott I answered this question last year: http://blogs.msdn.com/oldnewthing/archive/2004/11/19/266664.aspx#266840

  11. Anonymous says:

    Though I don’t program in Win32, I DO program in C, using VC98 from the command line. I read this blog because of the excellent learning experience.

    My earliest exposure to C was using the Desmet compilor, which fit on a single 5-1/4" floppy disk.

    I can’t get to Alan De Smet’s web site — it’s blocked where I’m working/spending time. Is there a connection between Alan De Smet and the desmet compilor?

  12. Anonymous says:

    Question (paraphrased): "If you’re not in to Win32 programming at all, then what are you doing here?"

    Because you have many interesting things to say other than on hard-core coding topics. I’m lobbying for a light-and-fluffy Raymond Chen blog. Please speak slowy and use small words…

  13. Anonymous says:

    How long have you been living stateside?

  14. To install Chinese fonts, go to Regional Options / Languages. This stage of the series will take about two weeks. Other stages will be shorter.

    Why not use MFC? Because my goal is to teach people what happens *below* MFC/ATL/WTL/etc.

  15. Anonymous says:

    In Cantonese Ben is often phonetically translated to "賓" which means "visitor/guest." Your friend might like this one better.

  16. Anonymous says:

    Raymond Chen is going to be developing a Chinese dictionary over the next while. This is a really cool…

  17. Anonymous says:

    This sounds like something good!

    I’ve longed for a PC or PPC enabled version of the excellent <a href="http://www.amazon.com/exec/obidos/tg/detail/-/0824821548/qid=1115754338/sr=8-1/ref=pd_csp_1/104-5971204-0497514?v=glance&s=books&n=507846">ABC Chinese/English Dictionary by John DeFrancis</a> for awhile — and this project sounds like a step in that direction, albeit with the CEDICT dictionary.

  18. Anonymous says:

    Raymond Chen&amp;nbsp;is running a series of articles about how to build and optimize the startup time of…

  19. Anonymous says:

    FYI, in Cantonese the closest match for "Ben" is 病(sickness).

    And yes, about a dozen or so English name does have homophones with undesired meaning in Chinese so some of us do want to avoid choosing them.

  20. Anonymous says:

    I want to go on the record and note that I will not be deveoping a Chinese/English Dictionary, in unmanaged…

  21. Anonymous says:

    If you translate the phonetic sound of my name into french and back into english, you get "Broken" (Casse).

  22. Anonymous says:

    bramster: Nope, the desmet compiler has nothing to do with me or any of my immediate. It must have been well loved (or perhaps loathed), because you’re not the first person to ask me (or my brothers) about it. If I recall, it’s a compiler from the mid-80s. At the time I would have been about 10 years old. I can only wish to have been such a prodigy. Me? I’m Just Another Programmer. I’m curious why your work blocked my site though…

  23. Anonymous says:

    1. Among the Japanese-Chinese characters that can be pronounced "ben", a frequent one is one that means both feces and convenience. For example "benjo" means toilet (feces place) and "benri" means convenience. I’ve read that Nelson couldn’t believe that a single character had both of these meanings, so his character dictionary has two separate entries for the same character. Well then there was the time I was in London and saw a sign with English words and an arrow pointing to "Public Conveniences" and it’s pretty obvious what kind of conveniences they were.

    2. > you’ll have a Chinese/English dictionary

    > program, which probably won’t be very useful

    > unless you’re actually studying Chinese…

    Or English.

    3. Monday, May 09, 2005 10:19 AM by Alan De Smet

    > if you start and end with MFC you’ll be

    > completely unable to debug deeper problems.

    Sure, that’s why you have to learn assembly language for every machine you use too. (Mr. Chen has previously blogged about this same fact.) Nonetheless do you think *everyone* needs to learn so many assembly languages and BIOS interrupts and DOS interrupts and NT native APIs and Windows APIs? Some people use C because they can develop applications faster and it works well enough. Some use VB for the same reason, some use Perl for the same reason, some use Java for the same reason, some use MFC for the same reason, and some will use VC++2005 for the same reason.

Comments are closed.