*Prerequisites: Experience with sort functions such as qsort.*

*Overqualifications: Familiarity with sort algorithms.*

If you break
the rules for sort comparison functions,
then don't expect the result of a sort call to be meaningful.
In fact, if you *really* mess it up,
you can corrupt memory or go into an infinite loop.

My program hangs when I used this sort comparison function (pseudocode):

int compare(x, y) { return x >y ? +1 : -1; }What's even more strange is that sometimes instead of hanging, the program crashes. It all works correctly if I add one line:

int compare2(x, y) { if (x == y) return 0; // added return x >y ? +1 : -1; }What's going on? The array I am sorting contains no duplicate elements, so the two items

`x`

and`y`

should never be equal.

First of all, your first comparison function clearly violates the requirements for being a comparison function: It must return zero if the two items compare equal. As a result, the behavior is undefined, and hanging, crashing, or returning an incorrect result are all legitimate behavior.

But let's dig a little deeper to see why hanging and crashing are plausible responses to an invalid sort comparison function.

A common sorting algorithm is quicksort. I'll leave you to go off and learn on your own how the quicksort algorithm works. Come back when you're done.

Okay, welcome back. The central step in quicksort is the partition, and some variants of the quicksort algorithm rely on the partition element comparing equal to itself in order to remove a comparison from the inner loop. For example, an index may walk through the array looking for an element greater than or equal to the partition, fully expecting at least one such element to be found because in the worst case, it will find the partition itself. But if your comparison function fails to report the partition as equal to itself, this search may run off the end of the array and eventually crash with an access violation when it reaches invalid memory.

That explains the crash, but what about the hang?
Well,
notice that the comparison function is also inconsistent.
In particular,
the anti-symmetry rule is violated:
`compare(x, y)`

and `compare(y, x)`

return the same value (as opposed to the opposite value)
if `x==y`

.
The function returns `-1`

both times,
saying
"`x`

is less than y"
and
"`y`

is less than x" simultaneously.
This inconsistency can easily send a sort algorithm
running back and forth trying to find the right place for
the partition.

The moral of the story is the same: Your comparison function must meet the criteria for a proper comparison function. If you fail to do this, then the results you get will be very strange indeed.

Just use stl::sort. It’s better than quicksort (its worst case is still n log(n) rather than qs’s n*n). And its comparison function is “less than”, which returns bool.

You’re focusing on the details and missing the big picture. What if you’re using a language that STL doesn’t support? (In fact, today’s example came from one such language.) -Raymond]"Just use stl::sort"

Yeah, well there are some standard library classes that accept a functor for comparison. Welcome back to the original problem.

But, the moral of this story goes well beyond sorting algorithms. If you violate the contract of an interface then bad things happen. I see it all the time, both intentional and unintentional.

Serendipity is a wonderful thing. It just so happens that I, too, am working on sorting today. I need to find all possible shortest cyclic subgraphs of a directed graph (not that the graph cannot be acyclic if I’m looking for cycles) and print them out. I was just looking at Kahn’s topological sorting algorithm, but it only applied to DAGs. I have an idea how to modify it, though.

Clearly, using stl::sort is not an option. :)

Um… what does that have to do with sorting? Sounds like an interesting problem, though.

Ah, takes me back to my coding days. A good pred function might be able to help out.

How to sort: Permute the set randomly, and then check to see if it’s sorted. If not, try again.

how to sort: periodically check to see if the set is sorted. Bit errors will eventually result in a sorted set.

@Maurits

It is technically a sorting problem with a partial ordering. Essentially, the nodes in a directed graph can be partially ordered (sorted) such that those node with outgoing edges appear in a list or other data structure before those node into which their edged lead. Check out <http://en.wikipedia.org/wiki/Topological_sorting> for an example. It’s true that this this sorting doesn’t actually compare the value of elements, but it is sorting nonetheless.

That takes way too long, Bogosort is a lot quicker: http://en.wikipedia.org/wiki/Bogosort

I guess this is supposed to be readable example code, but how can you pass up the chance to layer the ?: operators?

return x > y ? 1 : x < y ? -1 : 0;

Hey, don’t knock proton decay – one day, it’ll be the only energy source we have.

VB to the rescue:

Function Compare(X, Y)

Select Case X

Case Is > Y: Compare = 1

Case Y: Compare = 0

Case Is < Y: Compare = -1

End Select

End Function

Simpler, faster, branchless, correct:

int compare(int x, int y)

{

return x – y;

}

Criffer: compare(0x7FFFFFFF, -0x80000000).

And good luck generalizing to objects that don’t support subtraction. -Raymond]@Alexander – you know there are coders out there capable of writing a "less than" operator that only gives a partial ordering.

I’ve been playing with Ruby lately, and it uses the spaceship operator for comparisons. I can easily imagine someone writing a comparator that triggers an infinite loop.

— T

http://en.wikipedia.org/wiki/Bogosort#Quantum_Bogosort

^^^ That "destroy universe" operation doesn’t sound very thread-safe.

When we were developing Motif at OSF, we had a function that would determine a tab order for controls on a form, based on their "reading order" (left to right, top to bottom). The sort predicate was that if the two controls did not overlap in a vertical direction, they were sorted vertically, otherwise in the order of their X-coordinate. This was intended to return the best result when the controls were not nicely arranged on a grid but just jumbled about.

Motif (and the rest of the X stack) being OS-independent, this code gave reasonable results on HP-UX and Ultrix, but crashed very occasionally on Solaris.

It took us a couple of weeks to get to the bottom of it. It is possible to arrange three controls so that, according to this algorithm, A < B < C < A. Solaris’s sort implementation was sufficiently different from the others that it crashed with this combination of results (probably a SVR4/BSD difference). The others managed at least to do something.

The person who asked today’s question is the kind of person who substitutes salt for sugar when cooking.

Maurits:

A clever modern-day VB.NET implementation would be:

Function Compare(X As whatever, Y As whatever) As Integer

Return(If(X > Y, 1, Int(X < Y)))

End Function

This leverages the "new" VB.NET 2008 ternary intrinsic function .. works similar to how IIF() did in previous VB versions, but is now inlined and performs short-circuiting.

@ Criffer:

try x=MAX_INT, y=-1 and see it overflow.

I would love to have had this person walk up to my office and ask me that question.

“The array I am sorting contains no duplicate elements, so the two items x and y should never be equal.”

“Ever think that maybe it compares an element with itself?”

Cue face-palm.

The “operate an element with itself” is an oft-overlooked part of many operators. Particularly operator=. Then people have the gall to ask, “Why would someone set an instance to itself?” or “Why would someone compare an instance to itself?”

Because functions don’t always know that two pointers point to the same instance, maybe?

>Cue face-palmMore likely, you’d get the response, “Why would it do that? What idiot sorts by comparing an element with itself?” -Raymond

]

@Criffer, Phaeron:

I was going to write about the possibility of comparing using subtraction (restricted to ints, as Raymond’s response pointed out), but thinking through it, I realised that overflow could mess things up.

As many of you probably know, the processor (x86, at least) compares integers by performing a subtraction, setting flags according to the result and throwing the result away. However, the conditional branch instructions account for overflow.

The standard C++ sort is only guaranteed the average of O(n*ln n), not the worst case.

Interesting Finds: May 9, 2009

“Why would it do that? What idiot sorts by comparing an element with itself?”

Cue me executing a Picard double face-palm.

After which I’d ask, “And how do you propose a function knows that an element IS itself without a comparison?”

Duh.]`if (i == j) result = 0; else result = comparator(a[i], a[j]);`

. What idiot compares an item to itself? -RaymondThis reminds me of one of my comp. sci assignments: We had to implement several sorting algorithms, and for bonus points, display the operation of each algorithm graphically.

Viewing the various sorting algorithms running was quite fun (and made the way they work really apparent). Quicksort was not much fun to watch, but heapsort was cool.

Oh, and since I was into R.E.M. at the time, I called my program "A Carnival of Sorts". :)

Well, I think that the x<y (without x==y) is pretty good to speed up things. If I need to sort the 3,5,2,4,1,2 array, the result should be 1,2,2,3,4,5. Who cares that the "first" 2 is behind the "second" 2 or the vice versa. The faster it breaks, the less comparations are made. If the algorithm goes beyond the array’s upper bound, it will sort a stable program from a buggy program.

Alin:

You’ve missed the whole point of the post, if you don’t implement the = part, then your ‘sort’ may never finish.

If 2<2 and 2<2 then which one comes first? How do you know they are sorted?

Well, if it is the case that most items being compared will not be equal, you could speed it up and still maintain correct behaviour – just break it up into several tests:

int compare2(x, y)

{

if (x > y) return 1;

if (x < y) return -1;

if (x == y) return 0;

}

More statements/lines in the source language does not necessarily translate to more machine instructions in the object code, especially when optimizations come into the picture.

@Falcon:

That last if should be just a return. If it’s neither smaller nor larger, then it has to be equal. And if there were a third possibility, the function would return an undefined result.

@Falcon

Your function will break for simple floatingpoint numbers. Consider a=NaN and B=Nan

=> none of a<b, a>b and a==b is true

I think even "someone else"’s function might break down here when supplied with an infinity and a NaN, but I don’t remember the comparison rules for all special cases.

If you want to see animated sorts, have a look at this : http://www.sorting-algorithms.com/

BTW, on comparisons using subtraction, will it work if the numbers are unsigned instead of signed?

I trust my readers are smart enough to figure out the answer to this question themselves. -Raymond]Insightful. It’s one of those nitty gritty things that are obvious and yet not so obvious at first sight. Had someone asked this question before but I couldn’t really give a complete explanation like this one.

@someone else:

Yes, you’re quite right (for the integer case, at least). Can’t believe I missed that!

@W:

I admit I’m no FP expert, but I can see your point. Just goes to show how important it is to know exactly what you’re dealing with.

“Well, notice that the comparison function is also inconsistent. In particular, the anti-symmetry rule is violated: compare(x, y) and compare(y, x) return the same value”

Maybe I’m missing it, but I don’t see that. compare(2,1) => 1 while compare(1,2) => -1.

Is lack of coffee making me miss something?

After you finish your coffee, try reading the rest of the sentencce. -Raymond]Quote: More likely, you’d get the response, “Why would it do that? What idiot sorts by comparing an element with itself?” -Raymond

I find that thoughtlessness (or temporary stupidity) is far more common then actual persistent idiocy. Usually a small nudge in the right direction resolves that quickly. Maybe I have been luckier than most in my choice of coworkers?

In real life, you never compare an item against itself. Bubblesort never compares an item against itself. Many quicksort implementations never compare an item against itself. The idea of comparing an item against itself while sorting is actually quite strange. -Raymond]A funny part is that there are more than one standards for comparisons:

* standard C one: int compare(input_type x, input_type y) : Returns 0, or a negative value, or a positive value.

* standard C++ one: bool less(input_type x, input_type y) : Returns true if x is strictly less than y.

It gets awkward when people start mixing the two. Notably, MFC uses the C convention, but I often stumble on code that assumes it uses the C++ one…

@Roman,

A Standard C++ Library implementation included with Microsoft Visual C has such worst case guarantee. I don’t think other compiler vendors will settle for anything worse then.

I was curious why .NET chose QuickSort for their sorting, because its worst case is n^2 while HeapSort is always (n)Log(n). I heard that other factors come into play, like memory management etc. Any Ideas?

Oops: QuickSort does allocate memory, but it’s usually implicit. e.g. Variables in recursive calls.

When I said it doesn’t require any memory I meant it doesn’t need to make temporary copies of the data itself. It does still need some memory for bookkeeping.

@Todd Wilder:

If you look up QuickSort on Wikipedia it explains the likely reason.

QuickSort and HeapSort are both in-place (i.e. they don’t need to allocate any temporary storage for the sorting).

QuickSort’s worst case is worse than HeapSort’s, however QuickSort is still *usually* faster than HeapSort.

i.e. On average QuickSort is so good that its "bad days" are worth living with.

Of course, if you know that your data is going to be arranged such that QuickSort will very often be slow then you’d want to use a different algorithm. If you don’t know, though, then QuickSort is likely to be a good bet.

(While at uni I had the privilege of meeting and attending some lectures by QuickSort’s inventor, C.A.R. Hoare.

Many of my other tutors & lecturers had done impressive things, and Hoare himself has done a huge amount of other, more important/complex work, but there was something really cool about sharing tea and biscuits (among with other students in my year) with the person who invented QuickSort. Still the defacto sorting algorithm in just about every library despite being invented in the 60s.)

There’s always IntroSort (introspective sort). Average runtimes of quicksort, but worst case is still O(nlogn). The trick is it starts via quicksort, but it monitors itself and if it sees it actually has a quicksort taking O(n^2), it stops and goes heapsort.

Quicksort actually has well-known corner cases that cause it to go O(n^2), and most of them are unlikely to happen if you pay some attention. (I believe one of them is if the list is sorted backwards, in which case, swapping manually would be far faster).

The only bad thing is that quicksort isn’t stable – two items with the same value may get swapped. It matters if the comparison item is a subset of the entire item (e.g., sort by name an address book – if you have two people with identical names, quicksorting may swap the records around.)

When I want a stable sort, I use a merge-sort: A second buffer with the same size as the first, a copy only if necessary (whether log2(n) is even or odd), and a guaranteed O(nlogn) in all cases, best and worst.

A quicksort algorithm can be implemented not to swap identical records, leaving them in the original order.

Medinoc:

There is also an in-place variant of MergeSort() but I am not sure if it offers O(n log n) as an upper-bound.

It also seems to be quite a bit more complicated based on what little I have read about that variant.

The standard MergeSort() has the obvious advantage that it is trivial to parallelize, and the symmetry between best and worst case is a very attractive quality when writing for robustness (carefully crafted data to DoS a server using its sort algorithms weakness has been used at least once, targeting the Free Internet Chess Server)

I have to mention my favorite Syncsort ad again: From Computerworld magazine, in the heyday of mainframes, the full-page ad featured a computer operations manager claiming that using the Syncsort product allowed them to run their production jobs in 240% less time!