Spot the defect: Bad comparisons, part three


Did you notice how last time my length comparison on strings was unnecessarily verbose? I could have written it like this:

static int ByLength(string x, string y)
{
  if (x == null && y == null) return 0:
  if (x == null) return -1;
  if (y == null) return 1;
  return CompareInts(x.Length, y.Length);
}

static int CompareInts(int x, int y)
{
  // Positive if x is larger, negative if y is larger, zero if equal
  return x – y;
}

static Comparison<T> ThenBy<T>(this Comparison<T> firstBy, Comparison<T> thenBy)
{
  return (x,y)=>
  {
    int result = firstBy(x, y);
    return result != 0 ? result : thenBy(x, y);
  }
}

Much nicer! My string length comparison method is greatly simplified, I can compose comparisons easily with the ThenBy extension method, and I can reuse CompareInts as a helper function in other comparisons I might want to write.

What’s the defect now?

.

.

.

.

.

.

This one should be easy after last time. The lengths of strings are always positive integers and in practice they never go above a few hundred million; the CLR does not allow you to allocate truly enormous multi-gigabyte strings. But though CompareInts is safe for inputs which are string lengths, it is not safe in general. In particular, for the inputs Int32.MinValue and Int32.MaxValue, the difference is 1. Clearly the smallest possible integer is smaller than the largest possible integer, but this method gives the opposite result! CompareInts should read:

static int CompareInts(int x, int y)
{
  if (x > y) return 1;
  if (x < y) return -1;
  return 0;
}

The moral of the story here is that a comparison function that doesn’t compare something is probably wrong. Subtraction is not comparison.


Comments (22)

  1. Brian says:

    Sometimes I think that instead of returning an int, Compare functions should return an enumeration:

       enum CompareType //Built in.  With less sucky naming.

       {

           LeftLarger = 1,

           Equal = 0,

           LeftSmaller = -1

       }

       static CompareType CompareInts(int x, int y)

       {

           if (x > y) return CompareType.LeftLarger;

           if (x < y) return CompareType.LeftSmaller;

           return CompareType.Equal;

       }

    The meaning of returning an int to a comparison function strikes me as non-obvious, despite its popularity.

  2. Thomas Levesque says:

    I would have used x.CompareTo(y)… It also shows the intent much better than x – y

  3. John Kerr says:

    But subtraction _is_ comparison, if you take the overflow and carry flags into account!

  4. Bas Mommenhof says:

    Brian, the value returned by the function could be used by the sorting function as an optimization hint.

    (From memory I read some documentation about this, I think it was for Delphi, too long ago)

    It goes something like this:

    The further away from zero the returned value is, the more 'unequal' both items are.

    An optimized sorting function could use this knowledge in an attempt to skip a few items.

    Diff :  A – B = 1

    Diff :  E – F = 1

    Diff :  A – E = 5

    Therefore comparing A to F could be skipped since:

    A and B are real close,

    E and F are real close

    A and E are really far appart

    therefore A and F could never be close.

  5. Austin says:

    I had the same thought as Thomas.  Why would you not use x.CompareTo(y)?

  6. MWF says:

    @Bas:

    But then your spec for a comparison function would have to say that the value is not only of a certain sign, but also proportional to the difference.

    The comparison functions most of us use are only spec'ed as returning positive, negative, or zero, with no importance given to the magnitude of the positive or negative return values. If your sorting function assumes something about the magnitude of those values without the spec requiring it, you are wandering into very dangerous territory.

  7. Anthony P says:

    @Austin, to steal something Eric has said before, if CompareTo did not exist, how would you write it? Because all of the implementations of CompareTo, somebody had to write those. And people are still writing comparison functions, and they need to write good ones. No doubt many if not all of the examples presented in this blog series are from bug reports, either internally at Microsoft or from customers.

  8. Sergey says:

    Substraction *is* a comparison for unsigned integers. For example, this is wide spread technique in embedded development.

  9. Eamon Nerbonne says:

    Hmm, I'm not sure how meaningful all this is anymore – if you're defining custom comparers, that's generally because you've a specific usage in mind: one in which you generally will not deal with all possible values the type system permits.  In fact, it's generally unavoidable to have errors for values that are legal by the type system but semantically illegal (though indeed a wrong value is generally a worse type of error than some form of exceptional termination).

    So, as I see it, these comparison functions aren't so much invalid, they simply have a smaller domain than the type system indicates.  The alternative – verifying that inputs *are* in range or using comparisons – is significantly slower; so it's not easy to dismiss out of hand.

    And in general, it's not feasible for functions to always verify their input anyhow – so the focus on *these* border cases is nice, but hardly critical.  Having undefined behavior when preconditions are violated is common in all kinds of software – probably almost all software.

    I'd call these spec-bugs: the functions are in practice perhaps even superior to the "correct" versions – but their preconditions should be documented.

  10. Brian says:

    @Bas @Mwf: The optimization hint scenario is feasible, though compare does not promise any meaning on the int, so it is only practical under two situations:

    1.  The algorithm (e.g., sort) will still run successfully (and within any guarantees of speed it makes) even if the distance from 0 is arbitrary.

    2.  Only specialized libraries should use this type of optimization, since people will probably not be expecting this kind of optimization, and it may slow things down.

    Anyhow, this kind of optimization strikes me as the kind of thing that should *NOT* be built into the framework.  The people who need specialized comparisons can do some combination of using their own comparison libraries and just casting the enumeration to an int.

  11. Ben Voigt [Visual C++ MVP] says:

    @Sergey: No, it isn't.  The range of the result of a subtraction of values constrained only by type is always twice the range of the type, even if the type is unsigned, so wraparound is still problematic.  Of course, as Eric alluded to with the size of strings argument, if the values are constrained to a subset of the values allowed by the type, it may work.

  12. Adam Milazzo says:

    I've seen the x – y and -cmp idioms so many times — even in prominent examples of "well-written" code — that I've never thought to question them. Needless to say, I've grepped all my source code to check my comparisons now. Found one bug (-cmp to reverse), and one x – y which was okay because of restricted range, but I've added a comment to it now.

    So thanks, Erik, for these posts! 🙂

    @Bas: That "optimization" sounds pretty ridiculous. 🙂  I don't think you can safely glean information from any general comparison function. (Maybe you can make some assumptions if you're only sorting integers.) From your example, it does seem clear that A need never be compared to F, but for a different reason:

    E – F = 1 (implies E > F)

    A – E = 1 (implies A > E)

    A > E > F… so A > F.

  13. Random832 says:

    The origin is probably strcmp – strcmp did this in the original implementation in unix, and this is what the "may return any appropriate-sign value" contract was invented for. It was safe then, since what was being subtracted was chars which are smaller than ints, but it got it into people's heads that this is for some unknown reason a terribly clever thing to do (it avoids one or two branches, which was relevant on the PDP-11 but not so much today, and nevermind that strcmp requires three branches _per character_ until the point they are different)

    @Bas Mommenhof: What kind of sorting algorithm could make use of this and be faster [including the extra work for the comparison function to decide if the inputs are "close" or not] than a standard sorting algorithm? Also, this is an extra contract on top of it, and doesn't fit with current usage [with traditional strcmp implementations, "aa" "ab" results in 1, "aaaa" "aaaz" results in 25.

  14. Jasper Stein says:

    I'd chalk this problem up not to (the use of) comparisons but to the implementation of numbers in programming languages like C#. Mathematically speaking it's rather strange to have several different, bounded, types to represent numbers – and these problems highlight the problems that come with them. After all, Int32.MinValue – Int32.MaxValue == 1, WTF??? Of course in the olden days it used to be the case that representation of numbers had to be constrained by a few bytes at most, because of the inherent speed and memory limitations we had back then. But I'd argue that the present use of Int32's (and other Int## types) are dinosaurs from those early days.

    Progress has been made to eradicate many of the other early unwieldy/'dangerous' language structures (however useful they were back then – e.g. use of pointers, use of 0/1 for true/false, etc.) and it is high time we took another good look at the representation of numbers. I'm pretty sure there are programming languages out there that only have one notion of number, and do everything with it. (Well, maybe two – one discrete, one continuous.) Int32's and its friends may be retained for backwards compatibility and/or speed for programs that require them, but I say let's have a new all-encompassing Number type.

  15. Bas Mommenhof says:

    @Everyone: I am not actually advocating this.

    Im only stating what I think I remember i have read a long time ago.

    Brain asked where the int was comming from, this was a possibility.

    I am not too happy about everyone disagreeing with it, because that could be an indication that my memory is failing.

    I must be getting too old. (I still remember floppy disks.)

  16. pete.d says:

    "I still remember floppy disks."

    Oh for crying out loud. I just _used_ a floppy disk a couple of days ago. If simply remembering floppies makes you old, what does that make me? 🙂 (Granted, it was the "modern" 3.5" style…I admit it's been a few years since I've used a "floppy" where the outside was as floppy as the inside 🙂 ).

    "…but I say let's have a new all-encompassing Number type"

    Good idea! Count me in.  ðŸ™‚  (Actually, don't some of the other "less-rigorous" languages have this sort of unbounded, arbitrarily precise numerical data types? You're always going to run into problems with precision for non-integral types, but I agree that otherwise we should not have to worry about overflow any more).

  17. Shuggy says:

    To those suggesting the use of arbitrary precision floating point as the default do let me know what should be supplied when asking for pi.

    I can see arguments for using bigint by default in a language, python does this very nicely I think, but the Real vs Rational issue will always be a problem

  18. pete.d says:

    "To those suggesting the use of arbitrary precision floating point as the default"

    Who is doing that? In fact, I specifically called out floating point formats (i.e. "non-integral types") as a case where it's not possible to have the data type arbitrarily unbounded.

    Methinks you are failing the "give the benefit of the doubt" test. :p

  19. alexpolt@yandex.ru says:

    Believe or not but I made an effort and read your entire blog.

    And I want to thank you very much for your precious posts!

    Please write a book. You have a gift for showing with crystal clearity

    how to go from description to implementation.

    By now I am completely bewildered.

    Okay, if I need to write Airport control software than C# on .net is perfectly fine.

    If I need to write a Plant control software than I guess it will pass also. Will it?

    Maybe for spacecraft it's going to be good too…

    So then what C and C++ are left for? Only OS components and real time games?

    Was all my investment in learning assembly a mistake?

    ohh God!

  20. Theo Yaung says:

    I think it's definitely a good point to point out the integer overflow potential in comparison functions. However, the mathematically-minded purist in me twitches when you say "Subtraction is not comparison". In arbitrary precision universes it is, especially if you use the System.Numerics.BigInteger types from the .NET framework, then use the .Sign property to translate it back into the 32-bit domain 🙂

    In my mind, a good software engineer has a leg in two domains – the theoretical computer science domain, as well as in the pragmatic engineering domain. This is a good reminder article for the engineering aspect 🙂

  21. Ferdinand Swaters says:

    Moral of the Story: Don't use an int as the return value of a function that is supposed to have 3 distinct different outcomes.

    This spec of the comapre function may have made sense in the 70's, when the MicroProcessor was the latest new thing, but one wonders how a thing like that could ever have ended up in .NET

  22. SWeko says:

    I agree with Ferdinand.

    The return value of the comparison always just smelled wrong to me, as a remnant of the C++ days.

    Agreed, it enables you to write "return x-y" code, but again, is a method that does not compare, a comparer?

    What baffles me, is that C# got all those thing right, why was this left shaky at best?