An anecdote about improper capitalization


I’ve already discussed some of the strange consequences of case-sensitive comparisons.

Joe Beda mentioned the Internet Explorer capitalization bug that transformed somebody’s name into a dead body. Allow me to elaborate. You might learn something.

This bug occurred because Internet Explorer tried to capitalize the characters in the name “Yamada” but was not mindful of the character-combining rules of the double-byte 932 character set used for Japanese. In this character set, a single glyph can be represented either by one or two bytes. The Roman character “A” is represented by the single byte 0x41. On the other hand, the characters “の” is represented by the two bytes 0x82 0xCC. (You will need to have Japanese fonts installed to see the “no” character properly.)

When you parse a Japanese string in this character set, you need to maintain state. If you see a byte that is marked as a “DBCS lead byte”, then it and the byte following must be treated as a single unit. There is no relationship between the character represented by 0xE8 0x41 (錢) and 0xE8 0x61 (鐶) even though the second bytes happen to be related when taken on their own (0x41 = “A” and 0x61 = “a”).

Internet Explorer forgot this rule and merely inspected and capitalized each byte independently. So when it came time to capitalize the characters making up the name “Yamada”, the second bytes in the pairs were erroneously treated as if they were Roman characters and “capitalized” accordingly. The result was that the name “Yamada” turned into the characters meaning “corpse” and “field”. You can imagine how Mr. Yamada felt about this.

Converting the string to Unicode would have helped a little, since the Unicode capitalization rules would certainly not have connected two unrelated characters in that way. But there are still risks in character-by-character capitalization: In some languages, capitalization is itself context-sensitive. MSDN gives as an example that in Hungarian, “SC” and “Sc” are not the same thing when compared case-insensitively.

Comments (8)
  1. Timwi says:

    "in Hungarian, "SC" and "Sc" are not the same thing when compared case-insensitively."

    You probably mean "SZ"/"Sz", oder "CS"/"Cs", or any other Hungarian digraph, but "Sc" is not one of them :-)

  2. James Curran says:

    I’m curious…. Where in IE does it capitalize names?

  3. Joe Beda says:

    Wow Raymond,
    You are really digging in the archives. As for where IE does capitalization, I believe it was something to do with storing cookies or the cache, if I remember correctly.

    Joe

  4. Raymond Chen says:

    Joe is correct. The case conversion was done as part of autogenerating the filename for cookie storage. So if your name was Yamada and you went to your %userprofile%Cookies directory and did a "dir" you saw "dead body@msn.com", "dead body@yahoo.com", "@dead body@msnbc.com", etc.

    What makes the problem worse is that the error is in filenames. So fixing the bug means that everybody whose name contains CJK characters (and that’s an awful lot of people) will lose their cookies on upgrade. So fixing the bug introduces data loss. (And you can’t auto-upgrade the cookies since you don’t know which letters to "uncapitalize" and which to leave alone.)

  5. Mike Dunn says:

    What can really give you a headache is that DBCS languages have double-byte versions of the Roman letters, and files with those versions in different case can coexist.
    Using the notation <X> to mean "double-byte X" it’s perfectly legal to have <a>.txt and <A>.txt together in the same directory.

    I’m sure we’ve all at some point upper- or lower-cased filenames for comparison, since the file system is case-insensitive and all. However, if you blindly lowercase everything, you might end up operating on <a>.txt when you meant to operate on <A>.txt. D’oh. So you have to be sure you only change the case of single-byte letters.

  6. Matthew says:

    Raymond,

    I assume by:

    I’ve already discussed some of the strange consequences of case-sensitive comparisons.

    you actually meant:

    I’ve already discussed some of the strange consequences of case-INsensitive comparisons.

    Since that seems to be what the linked article was about.

    Seeya

  7. Jim Adams says:

    I am getting a weird error when reading the RSS feed for this page. The rendered page contains a unicode sequence (&#12398;) for the japanses characters. The RSS feed does not. Nor are they valid UTF-8 characters! (0xE3 0x81 0xAE). I haven’t figured out how to make my XML parser eat these so your blog has disappeared from my sites for a while. Any way to fix this? Is it a bug in BlogX?

  8. Raymond Chen says:

    You’re right, it looks like a bug in blogx. The actual blog entry uses &#12345; entities, but it looks like somebody was a bit too "helpful" when it game to spitting out the RSS. Sorry.

Comments are closed.