Unicode collation is hard


The principle of "garbage in, garbage out" applies to Unicode collation. If you hand it a meaningless string and ask to compare it to another meaningless string, you get meaningless results.

I am not a Unicode expert; I just play one on the web. A real Unicode expert is Michael Kaplan, whose explanation of how comparing invalid Unicode strings result in nonsensical results I strongly recommend to those who attempt to generate random test strings in Unicode.

Comments (8)
  1. Ben Hutchings says:

    This seems like a security risk to me. There is a potential denial of service if someone can pass Unicode strings with unknown characters in them to a program that attempts to sort them using CompareStringW. Surely it should return 0 if the strings aren’t comparable?

    Ah, I see that there is a security alert in MSDN now. Unfortunately I don’t see an explanation of how to test whether a string is valid for use with CompareStringW.

  2. Ed says:

    This and the previous thread about Unicode digits leads me to the opinion that a program can’t reliably do *anything* with a Unicode string other than regurgitate it for display to a human. There are just too many oddities in written human communication for a computer to handle them in a systematic way. If anybody agrees with me, then what we need is yet another encoding that tries not to represent all possible written human communication (as Unicode attempts), but is restricted to a consistent and manageable set intended for processing by computers. Does anything fit the bill?

  3. The amazing Peter Torr who knows that I’m Malaysian pointed me to the comments here so i guess i should say something ;-)

    >>>If Malayalam means a Malaysian alphabet, then one sure would expect the characters to be used together. Malay used to be written using Arabic characters, now it’s written using Italian characters, and I don’t know if there used to be other possibilities

    Malayalam AFAIK is an Indian language spoken in Kerala in South India. It is also spoken in Malaysia by the Malaysian Indians as Malaysia is a multi-ethniccultural country. But its origin is in India so it would be in Indian script and not Malaysian alphabet.

    Yes, Malay used to be written using Arabic characters. (I had to learn Arabic in primary school). The Malay language which is very close to the Indonesian language is Romanized. Eg:

    http://www.bharian.com.my/m/BHarian/Wednesday/Mukadepan/20040414053103/Article/

  4. Norman Diamond says:

    4/13/2004 11:35 AM Ed:

    > leads me to the opinion that a program can’t

    > reliably do *anything* with a Unicode string

    > other than regurgitate it for display to a

    > human.

    Sometimes even that is optimistic. For example, display to a human the Greek characters capital sigma and capital sigma, and ask the human if they’re identical or not. Even if the human knows that Unicode defines two different code points, the human can’t guess if the two being displayed are the same code point or different code points.

    Or display two perfectly constructed Japanese strings and ask a human to say which is greater than which. (In other words, call qsort() where the comparison function calls out to a human.) The human still won’t have enough information. Different kinds of applications have different requirements for what order the strings should be in.

    Now regarding the cited Google posting by Mr. Kaplan:

    > U+30fe is a Katakana iteration mark that has

    > some special properties in regard to

    > collation that are going to give dumb results

    > when mixed with non-Kana strings.

    Not being a Unicode expert and too lazy to write a program to output that at the moment, I’m guessing what character U+30fe is. The kana iteration mark can follow either a hiragana or a katakana. The rest of the string does not have to be kana. If the function gives dumb results on account of the rest of the string being mixed then the function is broken. It should be enough for the repetition marker to represent repetition of a single kana. (Similarly, the Kanji repetition marker only has to follow a single Kanji. But I don’t know if it can be used with an arbitrary Chinese character or if it can only be used with characters that were copied from Chinese into Japanese.)

    >> A = 0D42 65F9

    >

    > A Malayalam character and a CJK ideograph —

    > two characters one would never really expect

    > to be together.

    If Malayalam means a Malaysian alphabet, then one sure would expect the characters to be used together. Malay used to be written using Arabic characters, now it’s written using Italian characters, and I don’t know if there used to be other possibilities. But Chinese names are customarily written using both Chinese and Italian characters. It wouldn’t seem surprising to see a name written in three character sets. Compare to Thailand, where you can look at the front door of a company and see its name written in Thai, Chinese, and Italian characters.

  5. Peter Lund says:

    Do they use /Italian/ characters? Far out, Norman!

    I am happy to say that at least in Denmark we don’t use Italian characters, man that would be so confusing if we did.

  6. Norman Diamond says:

    In some countries I’ve encountered people who think the Roman alphabet is English. Calling the characters Roman or Latin doesn’t get the message across. So, even though Rome was Rome at the time of developing those characters, I label them by the present-day country where Rome is located. This also provides some parallel to Greek, Chinese, Japanese[*], Thai, and Korean characters (though not to Cyrillic, Arabic, Hebrew, and others).

    [* Though Japanese isn’t exactly parallel either. For example the Chinese-style character for a share of stock, which was invented in Japan, I’m told isn’t used in China, but in Japan it’s called a Chinese character not a Japanese character. Kana are called kana (hiragana or katakana) but it’s reasonable to call them Japanese characters.]

  7. Norman Diamond says:

    Sorry for two in a row, but I wish to amend my earlier followup (4/13/2004 7:50 PM).

    > The kana iteration mark can follow either a

    > hiragana or a katakana. The rest of the

    > string does not have to be kana.

    When the kana iteration mark is being used as a kana iteration mark, it can follow either a hiragana or a katakana. It was out of character for me to forget the other case, where the kana iteration mark is being mentioned rather than used. For comparison consider these sentences:

    1. A "quoted phrase" uses quotation marks.

    2. A double quotation mark looks like ".

    Case 2 should not be allowed in a C program even between #if 0 and #endif, but it should be allowed in a document.

    The server used by Mr. Chen will screw up the following example, but I have nothing better to offer.

    3. The kana iteration mark looks like ?.

    The kana iteration mark does not follow a hiragana or katakana but the sentence is meaningful. Should it be possible to use a word processing program to produce a textbook on Japanese grammar? Should it be possible to pass a string to a Unicode string handling function and expect results which, while not particularly meaningful, obey the rules for a Unicode string handling function? I vote yes.

  8. Raymond Chen says:

    Commenting on this entry has been closed.

Comments are closed.