On the fuzzy definition of a "Unicode application"


Commenter mpz wondered why the IME cannot detect whether it is sending characters to a Unicode or non-Unicode application and generate the appropriate character accordingly.

But what exactly is a Unicode application? Actually, let me turn the question around: What is a non-Unicode application?

Suppose you write a program and don't #define UNICODE, so you'd think you have a non-Unicode application. But your program uses a control provided by another library, and the authors of that library defined UNICODE. The controls created by that library are therefore Unicode, aren't they? Now you type that frustrating character to a control created by that library. Should it generate a U+00A5 or a U+005C?

To know the answer to that question requires psychic powers. If the control takes the character and uses it exclusively internally, then presumably the IME should generate U+00A5. But if the control takes the character and returns it back to your program (say the control is a fancy edit control), then presumably the IME should generate U+005C. How does it know? It's not going to do some sort of analysis of the code in the helper library to decide what it's going to do with that character. Even human beings with access to the source code may have difficulty deciding whether the character will ever get converted to the CP_ACP code page in the future. Indeed, if the decision is based on the user's future actions, then you will need to invoke some sort of clairvoyance (and relinquishing of free will) to get the correct answer.

Note that this helper library might be in the form of a static library, in which case your application is really neither Unicode nor ANSI, but rather a mix of the two. Parts of it are Unicode and parts are ANSI. What's a poor IME to do?

Comments (33)
  1. Pete Cawley says:

    In my current project, there is the engine and the GUI. Parts of the engine are unicode, parts of it are ASCII and the GUI is totally unicode. As the two parts are statically linked together, it would be impossible for IME to get it right.

  2. KJK::Hyperion says:

    Raymond, isn’t the IMM supposed to send a WM_UNICHAR in any case (to a non-IME-aware window, that is)? In my experience that never happens, though

  3. Steve says:

    Waitaminute! Is this Raymond Chen or Michael Kaplan??

    Am I getting my MS bloggers confused again?? :-)

  4. Mihai says:

    The "poor IME" should call IsWindowUnicode on the control and use that result to make the decision.

    If the control is Unicode and rest of the application is not, then the one who did the mix in the beginning should take care of the possible problems. IME "talks" to the control, not to the application, so it should give Unicode.

    One reason why you see this mixed approach is because one need to get some of the Unicode benefits without migrating the whole application. So one can add Unicode support for user input/output, serialization, file system access, and have an almost functional application, without defining UNICODE/_UNICODE.

    So having IME taking the decision to send ANSI would be the wrong one.

  5. "If the control is Unicode and rest of the application is not, then the one who did the mix in the beginning should take care of the possible problems."

    Then all ANSI applications that use the system edit control have a lot of work ahead of them. Good luck signing them up for this new work.

  6. Mike says:

    I remember that around the time I worked on Access 2000+XP, the data-store had been converted to Unicode, but there were still bits of UI hanging around which were purely ANSI. I haven’t looked in 5 years, but it wouldn’t surprise me if such pieces were still waiting to be rewritten. Michael Kaplan could undoubtedly recount many war stories from that time :-).

  7. Mihai says:

    "Then all ANSI applications that use the system edit control have a lot of work ahead of them. Good luck signing them up for this new work."

    If I am calling GetWindowTextA or SendMessageA( … WM_GETTEXT, …) , the controls gives me ANSI, I call GetWindowTextW or SendMessageW, the control should give me Unicode.

    Well, in fact this is not at control level, is deeper. So I call GetWindowTextA, which on Unicode systems is a wrapper and calls GetWindowTextW. When that one returns, the wrapper converts the result to ANSI and gives it back to the application.

    A while ago I got something on this:

    http://www.mihai-nita.net/20050306b.shtml

    I think everything is at API level, with W doing "the work" and the A doing the wrapping.

    Sure, I have no access to the Windows sources, but "stuff" seems to heppen "as if."

    Am I am missing something?

  8. Mihai: And it would be too late. The app would get 0x5C instead of 0xAC or vice versa. That’s the whole point of the article.

  9. dan.g. says:

    raymond,

    any chance of a (brief) explanation of how XP is able to treat non-unicode apps as unicode via ‘regional and langage options > advanced’. having always believed that the only way to display, say, chinese characters correctly was to compile with _UNICODE, this facility seems all the more remarkable.

  10. 8 says:

    "…we’re not going to use ANSI code pages any more (and use Unicode instead…" (Dean)

    except Unicode != Unicode. Now if we would all decide on UTF8… But there are always those who want UTF16…

  11. Dean Harding says:

    Why would we want to use UTF-8? It’s horribly inefficient for anything but US-ASCII… Sure UTF-16 has the crappiness with the surrogates, but then you could just move to UTF-32.

  12. Norman Diamond says:

    Wednesday, March 15, 2006 3:14 PM by oldnewthing

    > Mihai: And it would be too late. The app

    > would get 0x5C instead of 0xAC or vice versa.

    I don’t quite see the problem with Mihai’s suggestion.  As for determining whether the app gets 0x5C or 0xAC, surely that could depend on CP_ACP because the user will already be accustomed to whichever code page they’re using.  The user might even have been known to run other apps besides this one (on rare occasions) so they expect to get what their CP_ACP delivers.

    Back to the base note:

    > To know the answer to that question requires

    > psychic powers.

    So now you know that you’re well suited to writing the next IME ^u^

  13. Mihai says:

    "Mihai: And it would be too late. The app would get 0x5C instead of 0xAC or vice versa. That’s the whole point of the article."

    I still don’t get it.

    I type U+00A5 (the "correct" halfwidth yen). IME checks the control, is Unicode, sends U+00A5.

    The application calls GetWindowTextA. That is  a wrapper which calls GetWindowTextW, gets U+00A5, and before returning converts the buffer to ANSI.

    If the ANSI code page is 932, it is mapped to a "best fit," U+005C (there is no real halfwidth Yen in 932).

    If the ANSI code page is something other than 932, I probably get a question mark (I don’t know all tables by heart :-)

    If the application calls GetWindowTextW, it gets the original U+00A5, all nice and dandy.

    I intentionally avoid saying the "application is Unicode," is all about calling the A or W version of the API.

  14. But U+00AC is not a path separator. Converting it to 0x5C would be wrong, wouldn’t it?

  15. Dean Harding says:

    What should happen is everybody should one day decide that from now on, we’re not going to use ANSI code pages any more (and use Unicode instead), and everything that *does* use ANSI dies a horrible, horrible death from that day on.

    A man can dream…

  16. Ben Bryant says:

    I agree with Mihai. Simply put:

    There is no way to tell if an application is Unicode, but you can tell if a window is Unicode. Since IME deals with the window, not the application, it should be able to generate the appropriate A or W messages.

    What am I missing?

  17. Because the application that created the window might use only ANSI functions to access the data (GetWindowTextA, etc) – if the application only manipulates the data as ANSI, does it matter that the window is marked as a Unicode window? That’s my whole point. The effective behavior is ANSI even if the window itself is "Unicode". I guess I didn’t explain this well; I’ll try again later.

  18. Norman Diamond says:

    Wednesday, March 15, 2006 11:50 PM by Mihai

    > (there is no real halfwidth Yen in 932).

    0x5C is the real halfwidth yen sign in 932.  (Real in the sense of being a character in the codepoint, not the sense of being real Japanese.  Real Japanese in this case is really a Chinese character ^u^  But the symbol is also really used in Japan now.)

    In an ANSI code page, characters aren’t U+xxxx, they’re code page codepoints.

    If the app is Unicode then U+00A5 looks OK.  If the app is ANSI then the selection of codepoint should depend on CP_ACP.

    Oh yeah I see the problem there for Unicode apps.  The IME presents a list of candidates but how can the user distinguish two of the candidates.  If the user wants a character that will be printed as a yen sign then they want to select U+00A5.  If the user wants a character that will work as a path separator then they want U+005C, even though U+005C is a character that they’ve never seen in a Japanese system because Shift-JIS doesn’t have any codepoint for it.  The IME could display a half-width yen sign for U+005C, but then how could the user distinguish which half-width yen sign they want.

    This problem doesn’t arise for an ANSI window because 0x5C is the yen sign.  This problem only arises for a Unicode window.

    Anyone want to give me a million path separators so I can send path separators to Microsoft to renew my MSDN subscription?  (Not yen, because path separators are more important than money.)

  19. michkap says:

    As I think Raymond has indicated pretty clearly, it is not as simple as an IsWindowUnicode call. Because whether the specific DLGPROC or WNDPROC is expecting Unicode is what tells whether Unicode is expected or not, as modified by the actual message in some cases (like all of those Shell common control messages that are ‘A’ or ‘W’ specific).

    And the Unicodality of the window can also be a factor, as well as the habit people have of calling WNDPROCs/DLGPROCs directly rather than going through CallWindowProc (in which case all bets are quite literally off).

    Now in the specific case of the Yen, it will seldom survive as U+00a5 due to the best fot mapping to U+005c, which is the main reason that they are treated as being equal on a Japanese system. They look the same anyway, in that case.

    Summary — if you think its easy then you don’t understand the problem….

  20. Mihai says:

    <<Because whether the specific DLGPROC or WNDPROC is expecting Unicode is what tells whether Unicode is expected or not>>

    Yes. But the control is Unicode and has it’s own WNDPROC.

  21. Norman Diamond says:

    If CP_ACP is different from 932 but has a codepoint for the yen sign then the user probably wants to map the yen sign to the same codepoint that U+00A5 maps to.  I think I have a Vista beta 1 installation that is configured this way at the moment, and maybe can experiment.  I’ve configured English-language Windows XP in a similar manner for a few friends and maybe should ask if one of them will lend their machine for an experiment.

    In a Unicode application we would want U+00A5 to remain as U+00A5 except for the path separator problem.  By the way I really do agree that the path separator is more important in the Windows environment, but that doesn’t lessen the ugliness of the problem.

  22. Ben Bryant says:

    Oh, I didn’t pay attention, this is really ALL about the path separator, in which case even knowing if the window is Unicode does not tell you what the program wants. In short, yes "Unicode application" is fuzzy, but even if it wasn’t, it wouldn’t help the Yen/separator issue. That’s the problem with this post: it implies that knowing whether the application was Unicode would have a bearing on what character to generate.

    There is NO simple solution to the U+00A5 U+005C issue, because if the text happens to be a pathname, you will need U+005C even in a Unicode program. As Mihai pointed out elsewhere, in Japan the user may have the option of using the fullwidth Yen U+FFE5 when he doesn’t want the path separator, but otherwise in a Japanese local you generally have to drive U+00A5 to U+005C so as not to break pathnames. I’ve posted about this:

    http://codesnipers.com/?q=node/128

  23. Riffing on Raymond here, and his post On the fuzzy definition of a &quot;Unicode application&quot;….

    The points…

  24. Mihai says:

    Two of my previous comments did not make it, but let’s hope this one does:

    <a href="http://www.mihai-nita.net/w20060317a.shtml">http://www.mihai-nita.net/w20060317a.shtml</a&gt;

  25. michkap says:

    Hi Mihai —

    That assumes everyone in the WNDPROC/DLGPROC chain is also Unicode — a *huge* assumption, one that does not always bear out, unfortunately….

  26. mpz says:

    I am honored by Raymond taking the time to answer my question (the answer of which I guessed correctly, ie. next to impossible).

    God I wish they wouldn’t have mucked with the ASCII area when making the Japanese codepages. Would it have been that hard to put the halfwidth yen into the 128-255 area with all the other halfwidth characters.. I suspect it will take decades until all non-Unicode apps have died out and the IME can finally output the correct character.

  27. Mihai says:

    Hm, the link is all messed-up.

    Try this: http://www.mihai-nita.net/w20060317a.shtml

  28. Ben Bryant says:

    "assumes everyone in the WNDPROC/DLGPROC chain is also Unicode"

    No it doesn’t assume that (as far as my limited understanding can tell). It just supports the fact that you know whether the particular *window* is Unicode, not talking about everybody who hooks into the proc. The window is the point of reference by which the behavior would be consistent and as expected for everyone involved.

  29. Ben Bryant says:

    Even if non-Unicode charsets go away completely we will still have these yen and won sign problems. Japan and Korea display the Unicode character U+005C (backslash) differently and encode their halfwidth monetary symbol differently than the rest of the world. It will continue to be a problem as text is transferred between locales.

    Before Japan and Korea originated Unicode text is shared internationally, programmers have to be on the lookout to repair it so that the U+005C is only used in pathnames and not where it is meant to be a monetary symbol.

  30. Norman Diamond says:

    Friday, March 17, 2006 5:38 PM by mpz

    > God I wish they wouldn’t have mucked with

    > the ASCII area when making the Japanese

    > codepages.

    ASCII already mucked with the ASCII area.  In case anyone considers the "A" of ASCII to be ambiguous, ASCII was originally called USASCII.  The name made it clear that the encoding already included national character assignments.

    Other nations made national character assignments as they needed.

    If you don’t like it, don’t buy ASCII compatible equipment, buy equipment that uses 7-bit German encodings and persuade all your friends to agree with you.

    > Would it have been that hard to put the

    > halfwidth yen into the 128-255 area

    There is no such thing.  In general the 0-127 area consists of single-byte characters but I’m not sure if that’s required.  In principle the 128-255 area consists of characters that aren’t single bytes.  Shift-JIS (code page 932) does put some single-byte characters in part of the 128-255 range but non-Microsoft character sets don’t.  For example in EUC if a lead byte is in the 128-255 range then the character length will be 2 or 3 bytes.  In Shift-JIS the character length will be 1 or 2 bytes, and the maximum number of representable characters is smaller.

  31. István Németh says:

    Hi all,

    I have recently (today) run into a similar problem.

    I am trying to hook up on some ASCII dlls and ocxs in a .NET environment (and there are some detours library too, so it will be a wonderful hack :) ).

    I have found that there are SendMessageA/W, CallWindowProcA/W and Get/SetWindowLongA/W.

    So the question: should I call the SetWindowLongA to handle SendMessageA and to call it further CallWindowProcA, and similar to W? (sounds logical to me) Or is there anything else I should check before hooking on some window?

    So how fuzzy is the definition of a Unicode application"?

  32. mpz says:

    >> Would it have been that hard to put the

    >> halfwidth yen into the 128-255 area

    >

    > There is no such thing.  In general the 0-127 area consists of single-byte characters

    > but I’m not sure if that’s required.  In principle the 128-255 area consists of

    > characters that aren’t single bytes.

    Yes there is. Please have a look at the JIS X 0201:1976 standard. It specifies the most basic Japanese character set. 0-127 is ASCII (with backslash replaced by the yen and tilde replaced by overline) and halfwidth katakana are in the 161-223 area.

    *That* is the root of this problem. They shouldn’t have messed with the ASCII area.

    Microsoft then eventually chose the Shift-JIS algorithm to cram the JIS X 0208 set of characters into the unused space of JIS X 0201 (by putting the first byte of a doublebyte sequence into the 128-160 or 224-255 range). What EUC does with the characters is beside the point of the debate.

  33. Norman Diamond says:

    Sunday, March 26, 2006 12:42 AM by mpz

    > Yes there is. Please have a look at the JIS

    > X 0201:1976 standard.

    Ouch, I am amazed to see that this standard assigned the 160-223 range.  Previously I had only seen this standard cited for JIS-Romaji.  You are right about this part of it.

    > They shouldn’t have messed with the ASCII

    > area.

    Everyone (including ASCII) messed with the national character portion of the 0-127 range.  It’s somewhat like saying humans shouldn’t have messed with chimpanzee DNA, when they didn’t.  Both humans and chimpanzees made separate branches from a common ancestor.

    Telling Japanese to copy ASCII characters in that range is like telling Americans to copy German characters in that range, instead of curly braces and such stuff.

Comments are closed.

Skip to main content