What's wrong with this code, Part 3: The answers

In yesterday’s post, there was one huge, glaring issue:  It completely ignored internationalization (or i18n in “internet-lingo”).

The first problem occurs in the very first line of the routine:

                  if (string1.Length != string2.Length)

The problem is that two strings can be equal, even if their lengths aren’t equal!   My favorite example of this is the German sharp-s character, which is used in the German word straße.  The length of straße is 6 characters, and the length of the upper case form of the word (STRASSE) is 7 characters.  These strings are considered as equal in German, even though the lengths don’t match.

The next problem occurs 2 lines later:

                  string upperString1 = string1.ToUpper();

                  string upperString2 = string2.ToUpper();

The call to ToUpper() doesn’t specify a culture.  Without a specific culture, String.ToUpper uses the current culture, which may or may not be the culture that created the string.  This can create some truly unexpected results, like the way that Turkish treats the letter “I” as Tim Sneath pointed out in this article. 

Even if you call String.ToUpper() BEFORE making the length check, it’s still not correct.  If you call String.ToUpper() on an sharp-s character, you get a sharp-s character.  So the lengths of the string don’t change when you upper case them.

The good news is that the .Net framework specifies a culture neutral language, System.Globalization.CultureInfo.InvariantCulture.  The invariant culture is similar to the English language, but it’s not associated with any region, so the rules are consistent regardless of the current UI culture.  This allows you to avoid the security bug that Tim pointed out in his article, and allows you to have deterministic results.  This is particularly important if you’re comparing strings against constants (like if you’re checking arguments to a function).  If you call String.Compare(userString, “file:”, true) without specifying the locale, then as Tim pointed out, if the userString contains one of the Turkish ‘I’ characters, you won’t match correctly.  If you use the invariant culture you will.

As an interesting piece of trivia that I learned while writing the test app for this ‘blog entry, if you use System.String.Compare(“straße”, “STRASSE”, true, System.Globalization.CultureInfo.InvariantCulture), it returns TRUE J.

Btw, in case it wasn’t obvious, the exact same issue exists in unmanaged code (I chose managed code for the example because it was simpler).  If you use lstrcmpi to compare two strings, you’ll have exactly the same internationalization problem.  The lstrcmpi routine will compare strings in the system locale, which is not likely to be the locale you want.  To get the invariant locale, you want to use CompareStringW specifying the LOCALE_INVARIANT locale and NORM_IGNORECASE.  Please note that LOCALE_INVARIANT isn’t described on the CompareString API page, it’s documented in the table of language identifiers or in the MAKESORTLCID macro documentation.

The bottom line: Internationalization is HARD.  You can follow some basic rules to make yourself safe, but… 

Now for kudos:

Francois’ answer was the most complete of the responses in the forum, he correctly pointed out that my routine compares only codepoints (the individual characters) and not the string.   He also picked up on the culture issue.

Mike Dunn’s answer was the first to pick up on the first error, he correctly realized that you can’t do the early test to compare lengths.

Carlos pointed out the issue of characters that form ligatures, like the Unicode Combining Diacritical characters (props to charmap for giving me the big words that people use to describe things).  The sharp-s character is another example.  In English, the ligatures used for fi, fl, ff, and ffi are also another example of combining characters (Unicode doesn’t appear to have defined these ligatures btw; they appear in the private use area of some system fonts however).

Anon pointed out the Turkish i/I issue.

Sebastien Lambla pointed out the issue of invariant cultures explicitly.

Non issues:

Several readers pointed out that I don’t check parameters for null.  That was intentional (ok, not really, but I don’t think it’s wrong).  In my mind, passing null strings from the caller is an error on the part of the caller, and the only reasonable answer is for the routine to throw an exception.  Checking the parameters for null allows me to throw a different exception.  It’s also that the routine could be defined that a non null string compared with a null string was inequal, but that’s not what I chose (this is fundamentally stricmp()).

Also, some people believed that the string comparison routine should ignore whitespace, I’m not sure where that came from J.

Edit: Fixed stupid typo in the length of german strings.  I can't count :)

Edit2: tis->this.  It's not a good day :)