What’s wrong with this code, part 3


This time, let’s consider the following routine used to determine if two strings are equal (case insensitively).  The code’s written in C# if it’s not obvious.

            static bool CompareStrings(String string1, String string2)
            {
                  //
                  //    Quick check to see if the strings length is different.  If the length is different, they are different.
                  //
                  if (string1.Length != string2.Length)
                  {
                        return false;
                  }

                  //
                  //    Since we’re going to be doing a case insensitive comparison, let’s upper case the strings.
                  //
                  string upperString1 = string1.ToUpper();
                  string upperString2 = string2.ToUpper();
                  //
                  //    And now walk through the strings comparing the characters to see if they match.
                  //
                  for (int i = 0 ; i < string1.Length ; i += 1)
                  {
                        if (upperString1[i] != upperString2[i])
                        {
                              return false;
                        }
                  }
                  return true;
            }

Yes, the code is less efficient than it could be, but there’s a far more fundamental issue with the code.  Your challenge is to determine what is incorrect about the code.

Answers (and of course kudos to those who found the issues) tomorrow.

 

Comments (41)

  1. Steve says:

    Well, first things first… You’re not checking to see if the two passed in strings are null.

  2. Martijn says:

    I don’t know if this is what you mean, but one or both of the objects coming in could be null, causing a NullReferenceException. a better way would be checking for null and either throwing an ArgumentException, although throwing exceptions from this type of code is not recommended from Microsoft practice. We could also choose to return false, but that would cause false negatives, which would impact our business code.

  3. Jason S says:

    Well, for starters, the code assumes that a character by character comparison is valid. I’m no globalization expert, but I don’t think this applies to certain languages (I’m thinking Asian languages like Chinese). You’d want to use the StringInfo class to do culture-aware string breaking of each string, then compare the results.

    For that matter, does ToUpper() make sense for these languages? I imagine there is no upper case equivalent to a Chinese character. ๐Ÿ™‚

    Jason

  4. Mike Dunn says:

    Changing the case of a string is always fishy when it involves a language without the concept of upper/lower case.

    It’s also suspicious that the for loop checks string1.Length instead of upperString1.Length.

    It is possible that string1.Length != upperString1.Length or string2.Length != upperString2.Length? (I don’t program in .Net languages so the inner workings of .Net strings are a mystery to me, I’m just going on my knowledge of C-style strings and all the "fun" things you run into with localization.)

  5. Steve/Martijn: You’re right, I didn’t think about null parameters. My personal thinking is that just letting the null reference exception be thrown isn’t that bad an idea, since it’s a programming error on the caller.

    Jason: For languages like Chinese that don’t contain cases, it’s my understanding that ToUpper works like "123".ToUpper() – in other words it doesn’t change the string.

    Mike (and Jason): You nailed about 3/4ths of the answer: You can’t rely on length comparisons when comparing strings case insensitively.

    There’s still another huge issue in the code, but you’re on the right track.

  6. Prasanna says:

    If upperString2.Length < upperString1.Length, then upperString2[i] inside the for loop will blow up. But something tells me that this is not what you are looking for ๐Ÿ™‚

  7. Brian says:

    I’m guessing trailing spaces don’t matter.

    If (string1.Trim.Length != string2.Trim.Length)

    ….

    incase you compare: "asdf" and "asdf ",

    but then again, I guess those strings are not equal.

  8. carlos says:

    In Unicode, you can represent an "e" with an acute accent as either a single character or as two characters, the "e" and the acute accent combining-character. It doesn’t cope with this kind of thing.

  9. Greg Gallant says:

    Those strings need to be trimmed, getting rid of any excess whitespace, control characters, etc. How reliable is toUpper when dealing with special characters? Why use the original string1 variable as the for loop control when you are clearly validating the UpperStrings?

    And lastly, if doing a comparison, I would think one should never alter the original variable storing the data which is being compared. It almost defeats the purpose.

  10. Prasanna: Right, because the code has an implicit assumption that string1.Length == string1.ToUpper().Length.

    But that’s not the remaining issue.

  11. Daniel Jin says:

    absolutely no clue, but I guess when the upper strings aren’t the same length, you might get IndexOutOfRangeException.

  12. Greg: Why do the strings need to be trimmed? I never said that the check was for tokens.

    In this routine, "asdf" != "asdf ".

    carlos: Yup, thats the essence if Mike Dunn’s comment.

    There’s still another issue though (although the suggestions above are on the right track).

  13. Francois says:

    Actually, this code is wrong on more than one count…

    For starters, checking the input values for null for the two strings is required! Passing a null value as one of the strings to the current implementation generates an exception.

    Besides, this implementation issue, there is a more fundemental flaw to this function: this function does not compare strings but only code points.

    The .NET Framework uses Unicode for string handling and one of the properties of Unicode is that some characters can be represented in different ways (e.g. é (e acute) can be represented as a single 16-bit code point in UTF-16 or as two code points (e + an acute combining character which is implemented as eu0301.

    Checking the length of the strings and using a naive implementation as above will actually return false when comparing the following two strings "abcé" and "abceu301" which is plain wrong when you are dealing with strings as these are two equally valid representations of the same string. In comparison, the implementation of String.Compare in the .NET FX handles this case very nicely and returns the appropriate result.

    Therefore the function should be renamed to CompareCodePoints or CompareOrdinal or something similar as it is really what it does.

    However, even in this case, there is still another (albeit minor in this case as the two uppercasing transforms are done with the same culture) issue. It should really use a culture invariant uppercasing (pass CultureInfo.InvariantCulture as a param to the ToUpper method.

  14. S N says:

    I think I read some where in Raymond Chen’s blog that there is wide byte UNICODE characters. That is a language alphabet (letter) takes more than one string characters.

    If the above one is true, then there may be possibility that a letter can be represented in more than one way without involving lowercase/uppercase.

  15. Terry Denham says:

    I’d say that it may be possible that upperString1.Length != upperString2.Length and upperString1.Length != string1.Length that you can get Index out of bounds exceptions in the code and you don’t have try catch logic in the code to handle this possibility.

  16. Francois, I disagree on the checking for null. There’s no meaningful error that can be returned if the strings are null, the only thing that could be done is to explicitly throw an exception as opposed to letting the System.NullReferenceException be thrown.

    The key issue is your observation that it’s comparing code points, and the minor issue mentioned above touches on what I think is the key remaining part of the solution: What culture is string1 and string2 in?

    Terry: That’s part of Mike Dunn’s comment at the beginning: The code makes a fundimental assumption that string1.Length == string1.ToUpper().Length, which, as has been pointed out is not necessarily a correct assumption.

  17. I’ve tested Francois’s comment and indeed he’s right. eยด is not the same as é in this case..

    However I’m not able to find any cases of the string1.ToUpper().Length not being equal to the string1.Length. Can someone with more knowledge of the unicode tables verify?

  18. anon says:

    Of course, this would not pass the Turkish i/I test.

  19. ATZ Man says:

    If you have the e-acuteaccent 2-character sequence in string one and the ewithacuteaccent single character sequence in string two, do they ToUpper to the same string or do they ToUpper to different strings?

    Trivially, ISTR that when some cultures uppercase the diacriticals go away. This may or may not be just a computer subculture of some real culture. This would cause a ToUpper()-ing case-insensitive compare to behave differently from a ToLower()-ing comparison.

    Some Romance languages can have two diacritical marks on one letter so the potential offset could be more than just one per character.

  20. denis says:

    I guess using CompareInfo.Compare with CompareOptions.IgnoreCase should solve the problem.

  21. Scott says:

    Wel if string1.Length < string2.Length but the characters are the same it still returns true.

    in other words if string1= "ant" and string2="ants", the routine thinks they are equal.

  22. Scott, how does that work? The initial check is for string1.Length != string2.Length, not string1.Length <= string2.Length.

  23. Jeff says:

    I notice that the arguments to the function are "String" type, but upperString1 and upperString2 are "string" type.

    Is the typecasting issue what you’re looking for, Larry?

  24. josh says:

    Well I don’t use C#, but… it is case sensitive, right? What’s with "String" and "string"?

  25. Jeff: Nope, sorry – in C#, string is an alias for System.String, just like int is an alias for System.Int32 and long is an alias for System.Int64.

    The posters above have gotten all of what I was trying to point out. IMHO, Francois did the best job of explaining the real issue (that you can’t compare strings by comparing the characters in the string).

  26. josh says:

    *shakes fist at Jeff*

  27. forge says:

    string upperString1 = string1.ToUpper();

    string upperString2 = string2.ToUpper();

    You are creating an uppercase copies of those strings. I think it’s gonna be a performance issue

  28. Forge, you’re 100% right, that’s why I indicated to ignore performance.

    If I cared about performance, I’d have used something like String.Compare(string1, string2, true);

    Of course that code would suffer many of the same flaws that the original had but it would be far less obvious where the failure was.

  29. Jay says:

    Another minor thought. wouldn’t the ToUpper() calls potentially throw OutOfMemory exceptions? Someone calling a CompareStrings() function wouldn’t expect something like that to be thrown.

  30. Jay, an assignment can throw an OutOfMemory exception in the .Net framework.

    That’s one of the reasons why exceptions are so evil.

  31. Ben Lowery says:

    Assignments can throw? The right-hand side of the assignment that’s actually doing the work can throw, but I don’t think the actual assignment could throw. Subtle difference, but an important one I think. Unless I’m dead wrong. ๐Ÿ™‚ So, in code:

    object o; // cannot throw

    o = new object(); // new object() could throw

    object p; // cannot throw

    p = o; // cannot throw

    Is that wrong?

  32. That depends on what happens in object->operator object()

    If object->operator object() returns a reference to the original object it won’t throw.

    If it returns a new object, then it might.

  33. Paul Holden says:

    Hmmmm, might it generate a false positive if when converted to uppercase two characters match, but then converted to lowercase they don’t?

    Example code I’ve seen before for doing case-insensitive string compare does something like this:

    string testString1 = string1.ToUpper().ToLower();

    string testString2 = string2.ToUpper().ToLower();

    but I’ve never known enough about various character sets to know why!

  34. Paul Holden says:

    Ah – there’s a bit more on this here:

    http://www.lafstern.org/matt/col2_new.pdf

    Quote:

    "As an aside, converting both characters to uppercase might not always give the same

    results as converting both characters to lowercase: There’s no guarantee that the operations are

    inverses."

    Kind of what I was trying to say, but somewhat more eloquently ๐Ÿ™‚

  35. SeanH says:

    char[] upperString1 = string1.ToUpper().ToCharArray();

    char[] upperString2 = string2.ToUpper().ToCharArray();

    for ( int i = 0; upperString1.Length; ++i )

    {



    }

  36. SeanH, how does your version avoid the internationalization issues mentioned above?

  37. Lee Alexander says:

    Just a comment on nullability of parameters…I believe it does help to check for nulls at the entry point of the method as it conveys the method creators expectations explicitly. Also I would rather see an exception thrown at the point of entry into a method rather than somewhere buried in the code where it may take longer to figure out what it going on. Is this a bug or did the programmer mean the method never to take nulls. That’s why in C++ I try to use references as inputs where I don’t expect nulls and pointers where I do.

    Regards

    Lee

  38. SeanH says:

    I was guessing that String.ToCharArray() might return characters that the String indexer doesn’t.

    Looking at the Rotor source however reveals that even if COMString::GetCharAt ( String.this[] ) and COMString::GetPreallocatedArray ( String.ToCharArray() ) have different return types (StringObject* and Array<I2> respectively) they might have the same values anyway.

    Oh well ๐Ÿ™‚

  39. Sorrty I didn’t have the time to read through all the comments but:

    ToUpper is not culture invariant, so your code will break painfully in the case of turkish machines for example, that don’t capitalize letters in the same way as occidential do. What you really want is :

    string.Compare(stringA, stringB, false, CultureInfo.InvariantCulture) == 0

    this does a case insensitive comparison based on the invariant culture.

    I learnt this when one of my early projects (a monad like console but based on xml and an extension to the current batch language) broke down completely on a turkish machine…

  40. ilm says:

    Ilya: The German "ß" (called "sharp s" over here) doesn’t have an uppercase equivalent and is expanded to "SS".

  41. Scott says:

    whoops, my fault. Didn’t read it close enough.