How come Substring(0, xxx) matches something, but StartsWith returns false?

Article
09/23/2008

I was asked how a string can match a substring of another string, yet StartsWith can return false? For example:

string str = "Mux0308nchen";string find = "Mu";Console.WriteLine("Substring: " + (str.Substring(0,2) == find));Console.WriteLine("StartsWith:" + str.StartsWith(find));Console.WriteLine("IndexOf: " + str.IndexOf(find));

returns this:

Substring: TrueStartsWith:FalseIndexOf: -1

So if you test the first 2 characters with the search string, you'll see that they match, yet StartsWith() returns false, and IndexOf can't find it. This is because the 0308 diacritic is considered part of the u that it is modifying, so it won't be found. In many languages diacritics like this are really different letters. Since you don't expect a == z, then you wouldn't expect u == ü.

Doing the substring effectively "breaks" the character, changing its meaning. Substring can even create illegal Unicode if it chops off part of a surrogate pair (eg: U+D800, U+DC00).

A similar oddity would be characters with no weight like U+FFFD. So if I have str = "AxFFFDxFFFDxFFFD", then all of str.Substring(0,1) == str.Substring(0,2) == str.Substring(0,3) == str.Substring(0,4) == "A". And in this case str.StartsWith("A") would be true.

Another perhaps unexpected behavior would be unweighted characters (or ignored by a flag) at the beginning of hte string. So if str="xFFFD" + "A", then str.IndexOf("A") can return 1, yet str.StartsWith() will return true (even though IndexOf didn't return 0).

Similar behaviors can be seen with LastIndexOf() and EndsWith(), and with the native Vista API FindNlsString and its variations. In addition with the FindNlsString() API, the found substrings may be unexpected.

How come Substring(0, xxx) matches something, but StartsWith returns false?

Additional resources