The best way to process Unicode input is to make somebody else do it


Andrew M asks via the Suggestion Box:

I was hoping you could address how to properly code Unicode character input. It seems like a lot of applications don't support it correctly.

I'm not sure I understand the question, but the answer is pretty easy: Don't do it!

Text input is hard. It should be left to the professionals. This means you should use controls such as the standard edit control and the rich edit control. Properly converting keystrokes to characters involves not just the shift state, but the management of various input method editors, some of which are quite complicated. For example, the IME Pad lets the user draw a Chinese character with the mouse (or if you're lucky, the stylus), and then it will take the result and try to figure out which character you were trying to write and generate the appropriate Unicode character.

Other IMEs will generate provisional conversions of phonetic text into Unicode characters, and as more input is received, they can go back and revise their previous guesses based on subsequent input. You definitely don't want to get involved in this. Just leave it to the professionals.

Postscript: For those who have never used a phonetic IME, here's how a hypothetical English phonetic IME might work. Let's pretend there's an English phonetic keyboard with keys labeled with various phonemes. (Instead of IPA characters, I will use traditional American phonetics.)

You type Result
ә Uh
t Ut
ĕ A te
n A 10
sh A 10 sh
ә A 10 sha
n A tension
o A tension o
l Attention all

Notice how the IME keeps updating its guess as to what you're trying to type as better information becomes available. The text is underlined since it is all provisional. During the input, you can hit the left-arrow to go back to any part of the provisional text, hit the down-arrow, and see a list of alternatives, at which point you can override the guess with the correct answer. For example, if you really wanted to write "A tension all", you would arrow back to the word "Attention", hit the down-arrow, and select "A tension" from the menu. Eventually, you reach the end of a phrase or sentence, look over the provisional text, and after making any necessary corrections, you hit Enter, at which point the text is committed into the edit control and a new string of provisional text begins.

Comments (23)
  1. Psa says:

    Slightly offtopic, but do you know if there’s any end-user documentation for the IMEs that comes with Windows?

    I’m learning mandarin and it took me ages to work out that you need to type "v" in the IME to get a pinyin "ü" (that’s supposed to be a u with a diaeresis if it doesn’t show up).

  2. mmmh says:

    Is it always possible ?

    For example, is it possible to implement a complex control like the VS2005 syntax coloring edit control resorting only to the basic text editbox and/or the rtf-editor ?

  3. Guillaume says:

    In one console application, I use ReadConsoleInput to handle all things Unicode for me. Works fine.

    But if a stream is redirected in my application, stdin is not a console anymore and I loose ReadConsoleInput and all the Unicode goodies it provides (most of wich I wasn’t even aware of, mind you).

    Any tips on how to handle this ?

  4. Triangle says:

    What happened to the Raymond Chen who said “Programming is hard because nobody said it would be easy” ?

    [No sense making something harder than it needs to be. -Raymond]
  5. Triangle says:

    [No sense making something harder than it needs to be. -Raymond]

    What about the shell and COM you mentioned yesterday, wouldn’t it be simpler to allow objects created by one thread to be used by other threads?

    [You’re saying it’s just as easy to write a free-threaded object as a multi-threaded object? My experience suggests otherwise. -Raymond]
  6. Not a nitpicker says:

    Actually, it is just as easy to write a free-threaded object as it is to write a multi-threaded object.  Is it safe to assume you meant apartment-threaded instead of multi-threaded?

    [Right, sorry. free-threaded vs. single-threaded. -Raymond]
  7. Triangle says:

    [You’re saying it’s just as easy to write a free-threaded object as a multi-threaded object? My experience suggests otherwise. -Raymond]

    I mean using the object – not implementing it.

    [So you believe it should be more important to make using shell extensions easier at the expense of making it harder to write them. It’s a balance we’ve already discussed a few years ago; no point rehashing it. -Raymond]
  8. This looks very much like the way T9 input on mobile phones seems to work.

  9. MS says:

    "This looks very much like the way T9 input on mobile phones seems to work."

    Its the same story on some of the BlackBerry phones I’ve done a lot of coding for.  The predictive nature they use is pretty efficient if you take the time to learn it.  Using the standard edit controls there gains you all of this for free; of course, being a completely closed in Java solution, you generally can’t write new input methods.

  10. I’ll probably be dead wrong but last time I checked, EDIT and RICHEDIT_CLASS didn’t provide support for huge files, e.g. windowing/virtualization, maybe like WC_LISTVIEW does.

    Whenever I needed to wade through weeks of logs and traces, I came darn close to try and write it myself.

    It’s not so frequent anymore, though, now that we have multi-core and dirt-cheap RAM and I can keep working while the editor is catching breath. (Will notepad.exe ever support /3G?)

  11. brian says:

    Some people think they can do a better job then the programmers at Microsoft.  When I meet people like that I say "Well you obviously can’t, ’cause if you could you’d be working for them."

  12. Tom says:

    If it was always possible to hand off work to other people, we wouldn’t need to write much code, would we?  Of course the edit and RTF controls aren’t suitable for every possible situation.  But you should definitely try to use them *if possible*.

    I had to add proper IME support to someone else’s custom edit control not too long ago and it wasn’t that hard.  I paid attention to WM_IME_COMPOSITION and used the Imm*() API.

    Here’s the example I used as a guide – IME support in the context of a game:

    http://web.archive.org/web/20061109141509/http://www.libsdl.org/pipermail/sdl/2002-October/049962.html

  13. Jules says:

    "the answer is pretty easy: Don’t do it!"

    Good advice, but it’s clearly not always possible to follow it.  Many applications with non-trivial user interfaces will require something more advanced than either of these controls will handle (e.g. automatic text formatting, graphical variations like visible whitespace, etc.).

    As an example, one application I intend to write in the near future will require a text editor with automatic highlighting (like a syntax highlighting editor) combined with support for simple text formatting (e.g. choice of a few predefined font styles, underline and italics, first line of paragraph indents, etc.)

    This seems to me to be beyond the capabilities of the existing controls.  The project will be distributed as shareware and will likely not earn a huge amount of cash, so third party controls seem to be out.  This means I *need* to write something that will allow the user to enter text.

    So how do I do this?  Frankly, not being in the slightest bit familiar with IME, I don’t have a clue.

  14. Sven Groot says:

    There are legitimate cases for writing your own text editors. Are the Visual Studio editor and Word not examples of that? How do they deal with input, then?

    The nice thing about the IME is that it sends you window messages as it’s composing the string to let you know what it’s doing. This has been very useful for to me in one instance. :)

    [Of all the times I’ve been asked this question, I have yet to find someone who was intending to write a text editor. If you want to write a text editor, then you get to learn about the IME messages. -Raymond]
  15. Eric C Brown says:

    Jules:

    I would *strongly* consider using the richedit control for your text editor control; preferably richedit 4.1, as it has full Text Services Framework support (the supported way to implement IMEs).  If you have more questions, contact me via my blog.

  16. Dewi Morgan says:

    From the question, I understood it as "how do you deal with an input stream that may or may not contain unicodedata?", not "how do you deal with key inputs that should map to unicode output?" – but it was a woolly question.

    I strongly agree, if the question meant what Raymond interpreted it to, that you should avoid it like the plague. In Java 1.1 I tried, really hard, to do rich text (as a superset of unicode). After months of getting it wrong, with one bug popping up whenever I squished another, I retired from the fray, defeated, and used a Java 1.2 Swing component instead.

  17. e says:

    > When I meet people like that I say "Well you obviously can’t, ’cause if you could you’d be working for them."

    Pretty stupid comment. First because MS given its size has naturally a good number of bright heads as well as a much greater number of "standard-skilled programmers".

    Second you take for granted anyone’s dream is working for a big corporation in Redmond as opposite in "working for a smaller company (and all that this implies)", "working for a company closer to your home/family", "creating your own startup" or any combination of these and other factors.

  18. Developers, developers, developers says:

    MS has a long tradition of ease the work of (external) developers. Why should this be an exception?

  19. Anony Moose says:

    Based on the "if you can do better than X then you would work for X" theory, all companies (including both Microsoft and Apple) are the worst company, because if the developers at company Y could do better than the ones working for company X then they would work for that company because company X is always the best company in the universe for all values of X. You know, logic is a funy thing.

  20. bob says:

    brian:

    Some people think they can do a better job then the programmers at Microsoft.  When I meet people like that I say "Well you obviously can’t, ’cause if you could you’d be working for them."

    @brian:

    Then Brian, we must all assume that no other good programmer exists except the ones who work or once worked for Microsoft, now do you honestly believe that? Seeing Microsoft products over the years I have strong doubts ;-)

  21. I saw Raymond Chen’s The best way to process Unicode input is to make somebody else do it and I wholeheartedly

Comments are closed.

Skip to main content