Case mapping on Unicode is hard


Occasionally, I'm asked, "I have to identify strings that are identical, case-insensitively.  How do I do it?"

The answer is, "Well, it depends. Whose case-mapping rules do you want to use?"

Sometimes the reply is, "I want this to be language-independent."

Now you have a real problem.

Every locale has its own case-mapping rules. Many of them are in conflict with the rules for other locales. For example, which of the the following pairs of words compare case-insensitive equal?

1. gif GIF
2. Maße MASSE
3. Maße Masse
4. même MEME

Answers:

  1. no in Turkey, yes in US
  2. no in US, yes in Germany
  3. no in US, no in Germany, yes in Switzerland! (Though you would likely never see it written as "Maße" in Switzerland.)
  4. yes in France, no in Quebec!

(And I've heard that the capitalization rules for German are context-sensitive. Maybe that changed with the most recent spelling reform.) Unicode Technical Report #21 has more examples.

Just because you're using Unicode doesn't mean that all your language problems are solved. Indeed, the ability to represent characters in nearly all of the world's languages means that you have more things to worry about, not less.

Comments (2)
  1. David says:

    Raymond, I am no expert in Unicode, but from a seminar I attended a few years ago and from http://www.unicode.org/reports/tr10/#French_Accents it would appear that your example 4 is incorrect (ie no in France). This is an oddity of the French language that I do not remember having learnt at school in France, but discovered while living in England!

  2. Raymond Chen says:

    It is my recollection that French rules for capitalization is that when an accented character is converted to uppercase, it loses its accent mark. Therefore, a capital E compares equal to any accented lowercase e, because there is no such thing as an accented capital E. (However, an unaccented e does not compare equal to an accented e.) French Canadian, however, preserves the accent mark on capitalization. That was the rule I was trying to exhibit (and failed).

Comments are closed.

Skip to main content