Excellent blog about Windows and Unicode


Michael Kaplan has probably forgotten more about Unicode than most people know. He knows about the mysterious placement of the Won character in the Korean character set, and the same for the Japanese Yen character, what the invariant locale is, why Korean text sorts strangely if you pass the NORM_IGNORENONSPACE flag, and other strange and wonderful dark corners of keyboard layouts, character sets, and collation.

Around Microsoft, Michael is the local authority on Unicode. It's great that he's sharing his deep knowledge with the rest of us. (Note: I said "local" authority. Just because he's our main guy doesn't mean that he's your primary contact, too.)

Comments (23)
  1. Anonymous says:

    You know…. people are going to wonder about me when I’m a Linux geek and I’m caught reading all these (great) Windows blogs :-)

    Not only is Mr. Kaplan a master of the black Unicode arts, he can also write very well.

    I think a lot of the hard part about Unicode is almost nobody knows anything about other different languages, so they can’t tell if they screwed up their Unicode handling or not.

    Heck, it wasn’t until I had to paste German, cut from Excel, into a Python script running in a DOS window, that I had to learn about it. Then I had to convert that into the appropriate HTML entities, boy that was fun. NOT! That was more my fault for being Unicode-clueless, as everything supported it fine if I said the right magic word.

  2. Anonymous says:

    So is he the guy who rammed all of the Windings and PC ANSI character crap through the Unicode standards process? Other Unicode members still bitch about that.

  3. Anonymous says:

    Why ask me? Why not ask him directly?

  4. Anonymous says:

    Hi So Boy,

    No, I am not the man behind the Zapf Dingbats that were added to Unicode, that was not even Microsoft (most of the MS Wingdings are not even in Unicode, they are just symbols).

    The people who champion the Dingbats are font companies like Monotype and Adobe (also full members of Unicode).

    What does MS do there? We just nod and act in supportive ways of the needs of other members, since its only through cooperation that a huge committee can get work done. :-)

    Not sure which PC ANSI crap you mean, so I can’t comment on that. But Raymond is right, why not ask me directly?

  5. Anonymous says:

    Raymond: Well, he is only the *local* authority . . . so that means we have to get _you_ to go ask him. :-)

    The use (or lack of use) of an invariant locale caught me out once when parsing a script. For some reason, one of my Dutch users could not get my music player to work. I finally narrowed it down to setting an invalid volume – 10, when the range was 0 to 1.

    Where was it getting this number? Why, the script file that specified the volume! It was written something like this:

    Effect(PingPong200)

    {

    Stage

    {

    Pan(-1.0) Volume(1.0) Delay(0.2)

    }

    Stage

    {

    Pan(1.0) Volume(0.96) Delay(0.2)

    }

    …etc…

    Anyway, so it was reading the file in and translating 1.0 into 10! I never expected it to do that, but in the Netherlands they use a comma as a decimal "point" and so it was just ignoring the dot.

    It turned out that all I had to do was specify the invariant culture when reading the file. Fortunately this was a .NET project so no messing around figuring out how – it was just a parameter that I’d just never known I had to set. I guess it’s just one of those things you have to learn through experience. :-)

  6. Anonymous says:

    GreenReaper: I guess you just played "Ugly American". I am almost sure that a pretty good deal of European countries uses "," (comma) as its decimal separator and "." (point) as its thousand separator. For sure, this applies (at least) to Spain, France and Finland (I’ve been there).

    After all, I guess what Gene Cash said is true: troubles arise because we have no idea about how other languages are.

  7. Anonymous says:

    Michael: I think, by PC ANSI, he means the 0x2500 code page, with box drawings and other stuff like that. Things that appear if you send ‘nonprinting’ or high-bit symbols to a PC console.

    Eric: by the looks of things, the script file is internal, and factory-set. Not using the invariant locale would cause a factory-built script to work differently in different countries, which is an absurd result if I’ve ever seen one. Note also that you can’t change the locale of most programming languages: you MUST use ‘.’ as the decimal point in C, C++, Java, Python, Perl, PHP, SQL, Visual Basic… By this logic, any scripting language, no matter what it is, should use the invariant locale for its own parsing.

    Vorn

  8. Anonymous says:

    In the UK the ‘.’ is used for decimal seperation.

    And anyway we call them floating point, not floating commer.

    Ivan.

  9. Anonymous says:

    Eric Duran: No, this does not apply to France. The ‘.’ is used as a decimal separator, whereas there are no signs to indicate thousands. Although you can use a space to make it clearer.

  10. Anonymous says:

    Maxime: That’s interesting, the few times I’ve been to France, I think I may have seen both in use. Anyway, the locale for France (and most French locales in different areas, for that matter) in Windows seems to use ‘,’. Do you all change it before use in Excel and the like?

  11. Anonymous says:

    CN: Maxime’s wrong. The official use is comma for decimal separator, non-breaking space for thousands separator (source: <i>Lexique des règles typographiques en usage à l’Imprimerie Nationale</i>)

    Non-locale-aware software definitely made people more or less accustomed to the point as a decimal separator (I’m using "software" pretty loosely here; for instance, few hand-held calculators (except perhaps the very high end from HP’s glorious era) bother to do anything but the point). Actually "excessively" locale-aware software also helped people confuse the point and the comma, to the point that one accepted practice is to use the comma in handwriting, and the dot in print. The most prominent example of this "obsessively" locale-aware software is Excel, whose private remapping of the point on the numeric pad into a comma is logical, useful, confusing and unnerving, all at the same time.

    Commas as a thousands separator is something that’s going to confuse the heck of me forever. It just looks <i>wrong</i> ;-)

  12. Anonymous says:

    The comma ‘,’ must be used in France as the decimal separator. That’s what properly localized programs (e.g. the Windows calculator, or Excel) use anyway. However, many people are also using the dot ‘.’, mainly because we use lots of programs that do not support anything else.

    Juggling with different programs can be very painful at time. For example, copy-and-paste a table from one program to another one, and you won’t be able to use it unless you manually change each ‘.’ into a ‘,’ (or the opposite).

    So, the situation is:

    – good programs are properly localized

    – most programs aren’t and force you to use a dot

    – really smart programs are capable of handling both conventions at the same time. Very few do however.

  13. Anonymous says:

    I forgot to say that the French also use the dot as the thousands separator :-)

    Fortunately, no program that isn’t properly localized supports the comma for this task anyway. At least to my knowledge

  14. Anonymous says:

    Well, if this topic is seguing into locale settings for human readable numerals, then why don’t locale settings include a ten-thousands separator? In fact a list of ten-thousands separators. The word for ten thousand is different from the word for hundred million, they’re both different from the word for (American not traditional English) trillion, etc. I think the word for trillion is the largest used in daily life, since the total amount of all personal savings is somewhere between one thousand trillion and two thousand trillion, and even the national debt doesn’t yet need to use the word for ten quadrillion. But Rubik’s Revenge has a bunch of those words printed on the package, in the numeral for the number of possible configurations.

    Surely Mr. Chen’s ancestors want to ask him this question too ^_^ But he’s probably not to blame. I think we need to ask the IEEE Posix committee.

  15. Anonymous says:

    Hey GreenReaper — I do have a blog and people can always suggest topics there. No need to put Raymond in the middle. :-)

    Hey Vorn — the code page you are thinking of is actually the default OEM code page for a machine with a US English default system locale. Definitely not an ANSI code psge.

    Also Vorn — for the languages you list — if you call an API that converts strings to numbers, then you want a locale aware version. Otherwise, you take what you get….

    Hey Norman — they do include such a separator ability — add it to my list and I’ll post about it some day :-) : http://blogs.msdn.com/michkap/articles/271003.aspx

  16. Anonymous says:

    12/21/2004 12:55 PM Michael Kaplan

    > they do include such a separator ability

    OK, I didn’t look deeply enough, I just happened to notice some things in common in the output of a Unix locale command and the Windows registry. I saw a thousands separator, but didn’t see a ten-thousands separator or list thereof.

    > add it to my list

    I doubt it. When I’ve tried posting Kanji to Mr. Chen’s blog they have turned into mojibake, and I think most likely the same would happen if I posted them to your suggestion list. Besides, I’d have to pay the Japanese price for another Rubik’s Revenge in order to get the list of all those bignum Kanji again.

  17. Anonymous says:

    <i>Also Vorn — for the languages you list — if you call an API that converts strings to numbers, then you want a locale aware version. Otherwise, you take what you get….</i>

    I mean in the sense that you can’t, for instance, do this:

    <b>double k = 1,0;</b>

    If you’re writing a language, be it the Next Big Thing or a tiny little script reader for a music player, you want to use the invariant locale to interpret the language, so that anything written in your language can be used, without changes, worldwide. When you’re writing code to display things, you want to be locale aware.

    Vorn

  18. Anonymous says:

    Meh. Can’t use HTML, can I.

    Vorn

  19. Anonymous says:

    First: Raymond, your blog is GREAT. Really.

    About French locale (sorry Raymond, I know you are the MITM here), standard use for decimal is normally , but many are (were) using . because it is easier, for example, because it is on the numeric keypad.

    French (and others) Excel 5+ (IIRC) and Access 2+ changed this by erasing the normal behaviour and forcing VK_DECIMAL to emit a comma, perhaps it was easier for beginners but as a veterant user I always find it a Bad Idea ™, and I know several people that actually turned this off at regional preferences (some even asked us to provide them QWERTY keyboard to solve this; of course it was unnecessary).

    Thousands are normatively quarter-em space ("Chep" is right here), but many people think it is a dot (same basic story as http://weblogs.asp.net/oldnewthing/archive/2004/12/21/328759.aspx). Also, since usually fonts do not have u2005, and normal space results kind of ugly, dot is probably a good workaround.

    Something funny here is that standard for French says that the thousand separator is to be applied both at the left and at the right, so π should be 3,141 592 653 589 etc. (and if you see boxes, update your font ;-)). However I never succeeded at explaining this to the Unicode-involved people, or the Locale interface builders.

    I know nothing about Finnish, but I happen to learn much these days about uses in Spain. I am not living in Spanish-speaking part of Spain, and furthermore Spanish (Castilian) is more spoken in America, so my comments are not intented to cover the whole Spanish-speaking area, but use in Spain is not fixed when it comes to the separators. As for French, there are normative rules that seem to be , for decimal and . for thousands (except in years), but the most used convention is to have ‘ for the decimal separator, like in some parts of Switzerland. Of course with the euros and the return of the decimals in prices we now a lot more of them floating around. ‘ is usually spelled "con" in Castilian Spanish, this means "with" (similarly "amb" in Catalan), particularly with monetary; while , will be spelled "coma".

    Vorn: You are right about "most" programming language sticking to . as decimal separator. An "interessant" exception here is Excel or Access UI (not VBA), following the glorious 1-2-3 ancester, which chose to enforce the use of the locale parameters for the UI (as presented to the public). It is not too much of a problem for Excel (since the people that deal with both systems, international macros or VBA as well as normal functions) are not numerous, but it is really a problem for Access, since SQL (which does use .) is just around any corner.

  20. Anonymous says:

    12/22/2004 10:57 AM Antoine

    > (and if you see boxes, update your font ;-))

    Parentheses and smileys noted, but what font should be updated to what? I’m viewing this with Windows XP SP2, and Internet Explorer already displays Mr. Chen’s pages with a font chosen by the pages rather than Internet Explorer’s defaults. I haven’t studied CSSes in order to know what font is in use, and don’t know where to get a newer version of whatever font it is.

  21. Anonymous says:

    The point was that I wrote π with a u2005 every three digits, and I expected it to be displayed with boxes instead of a thin space.

    You are correct, I believe there are no good solution (except really black art like cut-and-paste into some editor; or perhaps browsing through Word).

    I do not believe we will see a version of Verdana (I think it is the font used, at least with me) which will include uni2005, at least in next future. And as far as I know IE does not allow local overriding of the style sheet.

  22. Anonymous says:

    Does Tools, Internet Options, Accessibility, "Format documents using my style sheet" not work?

  23. Anonymous says:

    Yeah! Great thing. Not very advertised (and will require a dosis of CSS pratice to handle this case), but now I do know it. ;-)

    Another way is to "kill" the styles using the same dialog sheet, then using a default font that happens to have the character. Less nice, but easier to do. Anyway it is just a variation on the same waltz.

Comments are closed.