# The wrong way of benchmarking the most efficient integer comparison function

On StackOverflow, there's a question about the most efficient way to compare two integers and produce a result suitable for a comparison function, where a negative value means that the first value is smaller than the second, a positive value means that the first value is greater than the second, and zero means that they are equal.

There was much microbenchmarking of various options, ranging from the straightforward

```int compare1(int a, int b)
{
if (a < b) return -1;
if (a > b) return 1;
return 0;
}
```

to the clever

```int compare2(int a, int b)
{
return (a > b) - (a < b);
}
```

to the hybrid

```int compare3(int a, int b)
{
return (a < b) ? -1 : (a > b);
}
```

to inline assembly

```int compare4(int a, int b)
{
__asm__ __volatile__ (
"sub %1, %0 \n\t"
"jno 1f \n\t"
"cmc \n\t"
"rcr %0 \n\t"
"1: "
: "+r"(a)
: "r"(b)
: "cc");
return a;
}
```

The benchmark pitted the comparison functions against each other by comparing random pairs of numbers and adding up the results to prevent the code from being optimized out.

But here's the thing: Adding up the results is completely unrealistic.

There are no meaningful semantics that could be applied to a sum of numbers for which only the sign is significant. No program that uses a comparison function will add the results. The only thing you can do with the result is compare it against zero and take one of three actions based on the sign.

Adding up all the results means that you're not using the function in a realistic way, which means that your benchmark isn't realistic.

Let's try to fix that. Here's my alternative test:

```// Looks for "key" in sorted range [first, last) using the
// specified comparison function. Returns iterator to found item,

template<typename It, typename T, typename Comp>
It binarySearch(It first, It last, const T& key, Comp compare)
{
// invariant: if key exists, it is in the range [first, first+length)
// This binary search avoids the integer overflow problem
// by operating on lengths rather than ranges.
auto length = last - first;
while (length > 0) {
auto step = length / 2;
It it = first + step;
auto result = compare(*it, key);
if (result < 0) {
first = it + 1;
length -= step + 1;
} else if (result == 0) {
return it;
} else {
length = step;
}
}
return last;
}

int main(int argc, char **argv)
{
// initialize the array with sorted even numbers
int a[8192];
for (int i = 0; i < 8192; i++) a[i] = i * 2;

for (int iterations = 0; iterations < 1000; iterations++) {
int correct = 0;
for (int j = -1; j < 16383; j++) {
auto it = binarySearch(a, a+8192, j, COMPARE);
if (j < 0 || j > 16382 || j % 2) correct += it == a+8192;
else correct += it == a + (j / 2);
}
// if correct != 16384, then we have a bug somewhere
if (correct != 16384) return 1;
}
return 0;
}
```

Let's look at the code generation for the various comparison functions. I used gcc.godbolt.org with x86-64 gcc 7.2 and optimization `-O3`.

If we try `compare1`, then the binary search looks like this:

```    ; on entry, esi is the value to search for

lea rdi, [rsp-120]          ; rdi = first
mov edx, 8192               ; edx = length
jmp .L9
.L25:                           ; was greater than
mov rdx, rax                ; length = step
test rdx, rdx               ; while (length > 0)
jle .L19
.L9:
mov rax, rdx                ;
sar rax, 1                  ; eax = step = length / 2
lea rcx, [rdi+rax*4]        ; it = first + step

; result = compare(*it, key), and then test the result
cmp dword ptr [rcx], esi    ; compare(*it, key)
jl .L11                     ; if less than
jne .L25                    ; if not equal (therefore if greater than)
... return value in rcx     ; if equal, answer is in rcx

.L11:                           ; was less than
add rax, 1                  ; step + 1
lea rdi, [rcx+4]            ; first = it + 1
sub rdx, rax                ; length -= step + 1
test rdx, rdx               ; while (length > 0)
jg .L9
.L19:
lea rcx, [rsp+32648]        ; rcx = last
... return value in rcx
```

Exercise: Why is `rsp - 120` the start of the array?

Observe that despite using the lamest, least-optimized comparison function, we got the comparison-and-test code that is much what we would have written if we had done it in assembly language ourselves: We compare the two values, and then follow up with two branches based on the same shared flags. The comparison is still there, but the calculation and testing of the return value are gone.

In other words, not only was `compare1` optimized down to one `cmp` instruction, but it also managed to delete instructions from the `binarySearch` function too. It had a net cost of negative instructions!

What happened here? How did the compiler manage to optimize out all our code and leave us with the shortest possible assembly language equivalent?

Simple: First, the compiler did some constant propagation. After inlining the `compare1` function, the compiler saw this:

```    int result;
if (*it < key) result = -1;
else if (*it > key) result = 1;
else result = 0;
if (result < 0) {
... less than ...
} else if (result == 0) {
... equal to ...
} else {
... greater than ...
}
```

The compiler realized that it already knew whether constants were greater than, less than, or equal to zero, so it could remove the test against `result` and jump straight to the answer:

```    int result;
if (*it < key) { result = -1; goto less_than; }
else if (*it > key) { result = 1; goto greater_than; }
else { result = 0; goto equal_to; }
if (result < 0) {
less_than:
... less than ...
} else if (result == 0) {
equal_to:
... equal to ...
} else {
greater_than:
... greater than ...
}
```

And then it saw that all of the tests against `result` were unreachable code, so it deleted them.

```    int result;
if (*it < key) { result = -1; goto less_than; }
else if (*it > key) { result = 1; goto greater_than; }
else { result = 0; goto equal_to; }

less_than:
... less than ...
goto done;

equal_to:
... equal to ...
goto done;

greater_than:
... greater than ...
done:
```

That then left `result` as a write-only variable, so it too could be deleted:

```    if (*it < key) { goto less_than; }
else if (*it > key) { goto greater_than; }
else { goto equal_to; }

less_than:
... less than ...
goto done;

equal_to:
... equal to ...
goto done;

greater_than:
... greater than ...
done:
```

Which is equivalent to the code we wanted all along:

```    if (*it < key) {
... less than ...
} else if (*it > key) {
... greater than ...
} else {
... equal to ...
}
```

The last optimization is realizing that the test in the `else if` could use the flags left over by the `if`, so all that was left was the conditional jump.

Some very straightforward optimizations took our very unoptimized (but easy-to-analyze) code and turned it into something much more efficient.

On the other hand, let's look at what happens with, say, the second comparison function:

```    ; on entry, edi is the value to search for

lea r9, [rsp-120]           ; r9 = first
mov ecx, 8192               ; ecx = length
jmp .L9
.L11:                           ;
test eax, eax               ; result == 0?
je .L10                     ; Y: found it
; was greater than
mov rcx, rdx                ; length = step
test rcx, rcx               ; while (length > 0)
jle .L19
.L9:
mov rdx, rcx
xor eax, eax                ; return value of compare2
sar rdx, 1                  ; rdx = step = length / 2
lea r8, [r9+rdx*4]          ; it = first + step

; result = compare(*it, key), and then test the result
mov esi, dword ptr [r8]     ; esi = *it
cmp esi, edi                ; compare *it with key
setl sil                    ; sil = 1 if less than
setg al                     ; al  = 1 if greater than
; eax = 1 if greater than
movzx esi, sil              ; esi = 1 if less than
sub eax, esi                ; result = (greater than) - (less than)
cmp eax, -1                 ; less than zero?
jne .L11                    ; N: Try zero or positive

; was less than
add rdx, 1                  ; step + 1
lea r9, [r8+4]              ; first = it + 1
sub rcx, rdx                ; length -= step + 1
test rcx, rcx               ; while (length > 0)
jg .L9
.L19:
lea r8, [rsp+32648]         ; r8 = last
.L10:
... return value in r8
```

The second comparison function `compare2` uses the relational comparison operators to generate exactly 0 or 1. This is a clever way of generating `-1`, `0`, or `+1`, but unfortunately, that was not our goal in the grand scheme of things. It was merely a step toward that goal. The way that `compare2` calculates the result is too complicated for the optimizer to understand, so it just does its best at calculating the formal return value from `compare2` and testing its sign. (The compiler does realize that the only possible negative value is `-1`, but that's not enough insight to let it optimize the entire expression away.)

If we try `compare3`, we get this:

```    ; on entry, esi is the value to search for

lea rdi, [rsp-120]          ; rdi = first
mov ecx, 8192               ; ecx = length
jmp .L12
.L28:                           ; was greater than
mov rcx, rax                ; length = step
.L12:
mov rax, rcx
sar rax, 1                  ; rax = step = length / 2
lea rdx, [rdi+rax*4]        ; it = first + step

; result = compare(*it, key), and then test the result
cmp dword ptr [rdx], esi    ; compare(*it, key)
jl .L14                     ; if less than
jle .L13                    ; if less than or equal (therefore equal)

; "length" is in eax now
.L15:                           ; was greater than
test eax, eax               ; length == 0?
jg .L28                     ; N: continue looping
lea rdx, [rsp+32648]        ; rdx = last
.L13:
... return value in rdx

.L14:                           ; was less than
add rax, 1                  ; step + 1
lea rdi, [rdx+4]            ; first = it + 1
sub rcx, rax                ; length -= step + 1
mov rax, rcx                ; rax = length
jmp .L15
```

The compiler was able to understand this version of the comparison function: It observed that if `a < b`, then the result of `compare3` is always negative, so it jumped straight to the less-than case. Otherwise, it observed that the result was zero if `a` is not greater than `b` and jumped straight to that case too. The compiler did have some room for improvement with the placement of the basic blocks, since there is an unconditional jump in the inner loop, but overall it did a pretty good job.

The last case is the inline assembly with `compare4`. As you might expect, the compiler had the most trouble with this.

```    ; on entry, edi is the value to search for

lea r8, [rsp-120]           ; r8 = first
mov ecx, 8192               ; ecx = length
jmp .L12
.L14:                           ; zero or positive
je .L13                     ; zero - done
; was greater than
mov rcx, rdx                ; length = step
test rcx, rcx               ; while (length > 0)
jle .L22
.L12:
mov rdx, rcx
sar rdx, 1                  ; rdx = step = length / 2
lea rsi, [r8+rdx*4]         ; it = first + step

; result = compare(*it, key), and then test the result
mov eax, dword ptr [rsi]    ; eax = *it
sub eax, edi
jno 1f
cmc
rcr eax, 1
1:
test eax, eax               ; less than zero?
jne .L14                    ; N: Try zero or positive

; was less than
add rdx, 1                  ; step + 1
lea r8, [rsi+4]             ; first = it + 1
sub rcx, rdx                ; length -= step + 1
test rcx, rcx               ; while (length > 0)
jg .L12
.L22:
lea rsi, [rsp+32648]        ; rsi = last
.L13:
... return value in rsi
```

This is pretty much the same as `compare2`: The compiler has no insight at all into the inline assembly, so it just dumps it into the code like a black box, and then once control exits the black box, it checks the sign in a fairly efficient way. But it had no real optimization opportunities because you can't really optimize inline assembly.

The conclusion of all this is that optimizing the instruction count in your finely-tuned comparison function is a fun little exercise, but it doesn't necessarily translate into real-world improvements. In our case, we focused on optimizing the code that encodes the result of the comparison without regard for how the caller is going to decode that result. The contract between the two functions is that one function needs to package some result, and the other function needs to unpack it. But we discovered that the more obtusely we wrote the code for the packing side, the less likely the compiler would be able to see how to optimize out the entire hassle of packing and unpacking in the first place. In the specific case of comparison functions, it means that you may want to return `+1`, `0`, and `-1` explicitly rather than calculating those values in a fancy way, because it turns out compilers are really good at optimizing "compare a constant with zero".

You have to see how your attempted optimizations fit into the bigger picture because you may have hyper-optimized one part of the solution to the point that it prevents deeper optimizations in other parts of the solution.

Bonus chatter: If the comparison function is not inlined, then all of these optimization opportunities disappear. But I personally wouldn't worry about it too much, because if the comparison function is not inlined, then the entire operation is going to be dominated by the function call overhead: Setting up the registers for the call, making the call, returning from the call, testing the result, and most importantly, the lost register optimization opportunities not only because the compiler loses opportunities to enregister values across the call, but also because the compiler has to protect against the possibility that the comparison function will mutate global state and consequently create aliasing issues.

Tags

Once again, turns out readable code means good (and efficient) code. While I have no insight on how compilers evolved (nor I have used any pre-2010 compilers), this kind of an optimization might have made sense 10 years ago, back when optimizers weren’t as smart as nowadays?

2. Mr Cranky says:

Seems to me the obvious way to code this is:

int compare0(int a, int b)
{
return (a-b);
}

Awfully long discussion that boils down to Dr. Knuth’s long-standing axiom that the root of all evil is premature optimization.

1. And then it returns the wrong value for `compare0(-1610612736, 1610612736)`.

1. henke37 says:

Given that `int` can be a `short int`, the call is wrong to begin with. Really, the function is safe if you know of the range limitations.

1. Okay, let me rephrase it then: It returns the wrong value for `compare0(INT_MIN, INT_MAX)`.

2. Viila says:

I wonder how the corrected version
int64_t compare64(int32_t a, int32_t b)
{
return int64_t(a) – int64_t(b);
}
compares to the others, especially if compiled as x64 instead of x86.

1. Joshua says:

It probably doesn’t work. The prototype returns int.

I can make it execute in zero cycles if it doesn’t have to work.

2. No need to ask. Type it into gcc.godbolt.org and find out! (Note that gcc.godbolt.org doesn’t support 32-bit compilers so the extra cost of 64-bit arithmetic on 32-bit systems is not visible there.)

1. Viila says:

I can, but I’m afraid I don’t know how to properly interpret the results because I don’t speak x86/x64 assembly.

(Joshua, the prototype was templated and the return value was assigned into an ‘auto’ variable, so it wouldn’t be truncated either. It should work.)

2. ErikF says:

If you look at the assembly results, you can hover over the instructions (or right-click and choose “View asm doc”) and you’ll get a brief synopsis of what the instruction does. Also, you can click on “Show optimization output (clang only)” to see what choices the optimizer made for each line of code (gcc has a similar option.) If you grok ARM(32/64), AVR, MIPS(32/64), MSP430, or PPC better, those compilers are available too, as well as a couple of other compilers (MSVC++, elicc, icc, and Zapcc). It’s pretty cool: I wish I had this when I first started out learning programming!

3. Rick C says:

I had a friend in college who wrote, as part of his first programming course, a version of a factorial program. If the caller passed in a number greater than around, or whatever would result in integer overflow, it would return 0 and print out “I only work with small, non-negative integers.”

“premature optimization is the root of all evil”

Is very out of context; full quote is:

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

3. creaothceann says:

“@Raymond I’m forced to agree.”

Heh.

4. The dangers of being too clever.

5. pmbAustin says:

As compilers get smarter, the sins of “premature optimization” have worse and worse consequences.

Always write the code to be understood by some human being that comes after you and has to maintain it.

1. yukkuri says:

Seriously. Let the compiler did its job and you do -your- job: writing maintainable code.

6. Ted Spence says:

Somebody once had a quote about “Premature Optimization”. I’m not sure who it was, maybe someone famous. /s

Also: we tend to spend many times more hours reading our code rather than writing it. If you optimize for readability, chances are you have a simpler bit of code, and the compiler will be able to do more optimizations.

If you write the world’s most clever code, chances are the poor intern who tries to fix a bug six years later will have trouble, and the compiler won’t like it either.

Also: I had not seen this compiler explorer website before ( https://gcc.godbolt.org/ ). This is the greatest thing since compiled bread. If only I still wrote c++! I sometimes regret switching to C# and leaving the world of pointer math and insane templates behind.

1. skSdnW says:

2. smf says:

Premature optimisation is subjective, one person thought it was the right and someone else said it was done too early. We assume the person who comes along and says it was premature optimisation, but maybe at the time it was the right thing to do. Or it’s not, who knows.

I’ve been optimising some 30 year old 6502 code, does that count as premature or not?

1. Darran Rowe says:

I always see the term premature optimisation as quite objective.
What it means is an attempt at manual optimisation before you know if what you are intending to optimise is really the bottle neck. So any time you look at code and think “that looks slow” then it is likely premature.
So basically if you have tested and timed and after all this you find that you are in a certain code path a lot and that is causing you a lot of slow down, then that is the code that you should optimise.

1. Ted Spence says:

Agreed – everything is subjective until you measure it!

Shorthand: Write code that is easy to read until you have a measurement.

7. Kevin says:

> But it had no real optimization opportunities because you can’t really optimize inline assembly.

At first I found this confusing, but then I realized why: Inline assembly cannot be decompiled into an intermediate representation, because it does not follow the semantics of the C abstract machine. Even assuming you have a really good decompiler available, you cannot apply as-if rule transformations to decompiled inline assembly. You might end up altering the meaning of the assembly in ways that the C standard does not consider “observable behavior,” but which nonetheless leave the CPU or other components in a different state from what the programmer expected. While this might be legal because asm is implementation-defined, it is obviously unhelpful to the point that the average user would call it a bug. So the compiler is obligated to treat the assembly as a black box, as Raymond says.

1. Paul Topping says:

It is possible to optimize inline assembler but there’s not much reason a C/C++ compiler would provide that functionality. The inline assembler feature exists solely to allow the programmer to specify an instruction sequence exactly and expects the compiler not to change it.

1. Joshua says:

Correct. The only purpose left for asm blocks is to give the compiler information it cannot possibly have.

About half the asm blocks I see don’t have any assembler code in them, just compiler hints (hey compiler, this empty asm block modifies this local variable).

2. Jon says:

The whole point of inline assembly is for when you want to do something the compiler doesn’t allow.

Perhaps you are using an assembly instruction the compiler doesn’t allow. For example: CPUID, when it first came out.

Or perhaps you have some heavily-optimized code that you’ve written taking into account caches and things that the compiler isn’t smart enough to optimize on its own.

Or perhaps you’re doing something really weird where you need exact control over the instructions. For example, a popular open-source kernel has a copy-from-userspace macro that uses inline assembly, and emits special exception-handling instructions so that a segfault during the copy can be handled specially. Another example: detecting 386 vs 486 processor by emitting instructions that get processed differently on the two processors. A third example: An open source memory debugging tool runs a program in a sort-of-VM, with JIT compilation to add checks to all memory accesses, and it has a magic sequence of instructions that do nothing useful on a normal CPU but that precise sequence of instructions are detected by the JIT and allow you to pass information to the memory debugger.

In ALL of those cases, there’s nothing sensible or helpful that the compiler’s optimizer can do, and if it touches the code at all then it’s likely to break it. So compilers do not try to “optimize” inline assembly, and this is the correct behaviour.

In the cases where the compiler’s optimizer could do something useful, such as writing SSE code, then the opcodes are understood and supported by the compiler. In those cases, if you want the compiler to optimize for you, you could and should be using compiler intrinsics instead.

8. Paul Topping says:

Good post! As you point out at the end, while modern compilers do a wonderful job of optimization, far beyond what most programmers would come up with given limited time and imagination, there are still reasons for programmers to look at what the compiler generated in order to do further optimization but the task has changed. Instead of trying to improve on the compiler’s code generation, the programmer should look for things that the source code implicitly asks to be computed and remove those that are not needed to achieve the program’s purpose. Perhaps some future compiler could output suggestions along this line.

9. Roman says:

I tried pasting your code into Compiler Explorer, but got different disassembly. Did you use any options other than -O3?

1. Hm, the output is slightly different for `compare1` now. The conclusions still stand, though.

10. pm100 says:

This discussion completely ignores what happens once the CPU gets hold of it. CPU effectively runs another optimizer on the code , but at run time. I was surprised that you didnt actually time the execution, rather than just reading the generated code. I totally agree that people are fixated on ‘less lines means faster’.

1. Antonio Rodríguez says:

The object code for the different methods is very different. You can assume that three instructions will execute faster than nine instructions, so compare1 and compare3 will be faster without doubt. This can be false if something goes wrong (i.e., unexpected cache fail or pipeline flush), but in that case, any optimization is futile.

1. Dmitry says:

You can’t. You once could but not these days. Modern processors do a lot of runtime optimizations. Three instructions might generally show worse results than nine equivalent instructions due to pipelining issues (instructions pairing, for example). Besides, I once saw a (synthetic) test where a floating-point instruction in a separate procedure turned out to be faster when timed than the same instruction inlined (the call cost turned out to be negative) with processors of one major CPU vendor and slightly slower on processors of another one, even though the name of the second vendor was shorter than that of the first :-)

2. smf says:

>You can assume that three instructions will execute faster than nine instructions

You can assume that, but that might not be a correct assumption. It would be nice if you could see the timing for each different cpu, but it’s a huge amount of work.

3. Antonio Rodríguez says:

I’d agree in case of a smaller difference (i.e., four/five instructions versus three). But in the case of nine to three, you’d have to go to the most extreme case for the nine instructions to be faster (nine simple register-to-register instructions without memory access which translate into a single micro-op each versus three register-to-memory with complex addressing modes which can translate to three or four micro-ops each).

Both compare1 and compare3 are a memory (register indirect, without indexing) to register compare (two micro-ops) and two conditional branches (one micro-op each). A total of four micro-ops. I can’t imagine how a processor can optimize at run time nine instructions in less than four micro-ops, when the compiler hasn’t been able to do the same at compile time.

You can argue about out-of-order execution, pipeline stalls and the like. But when you take into account that compare1 is a strict subset of compare2 (in other words, compare2 contains all three compare1 instructions plus six additional ones), it’s difficult to imagine a case where the longer sequence is less probable to get stalled.

4. Patrick says:

You can most definitely not assume that.

Comparisions actually being one of the most common cases nowadays since conditional jumps can really screw with branch prediction if the pattern isn’t, well, predictable. In this case I don’t think it really matters unless you can do the entire operation the comparision is used for without conditional branching, but there are some cases where it definitely does.

For example, binary searches perform much worse than you’d expect from just looking at the instructions themselves, since by definition they branch in the worst possible pattern (0.5/0.5 taken/not taken).
A linear search is actually faster for N < surprisingly large number – rule of thumb says around 8 or 10 depending on who you ask.

And for some other, more traditional, examples: x86 has quite a few 'big' instructions that are actually slower than just doing the same thing on your own. The real classic here being LOOP, which no performance-aware assembly language programmer or compiler has ever used. Even last time I benched it on a modern system it was significantly slower than the equivalent DEC + JNZ (roughly 2x). I suppose the modern CPUs could very well optimize it to be just as fast but there's no need to because of this.
(And DEC is in turn slower than SUB reg,1 on modern CPUs because of a historical design mistake which gives DEC a dependency on a flag which SUB lacks…)

11. Bender says:

For future posts you can use http://quick-bench.com/ for quick comparisons

12. Bram Speeckaert says:

The -120 comes from GCC taking advantage of the 128-byte red zone. The AMD64 ABI guarantees that the 128 bytes beyond whatever rsp points to will never get modified by interrupt handlers, so it’s okay to grow the stack pointer by slightly less than you actually need and just make the array start in that memory region.

13. smf says:

“with x86-64 gcc 7.2 and optimization -O3.”

wait, what? godbolt supports msvc too, so why did you show us the result of gcc????

1. Yeesh, I poke my head out of the bubble and you tell me to get back in the bubble. It’s not like this issue applies only to Microsoft compilers.

1. Joshua says:

I agree. Yeesh.

2. smf says:

My faux surprise was a fishing attempt in case there was some interesting news that you wouldn’t answer a direct question about.

You can stay out of the bubble :D

3. xcomcmdr says:

“Breaking news – MSVC might be replaced by gcc, a Microsoft lead architect says.”

(Oh God, what have I done…)

2. ErikF says:

Why not? Almost every compiler does things differently, and it’s instructive to see what the differences are (and why they’re done that way.) For example, it appears that clang prefers cmov instructions, which is a potential speed boost but is a problem if you have to support pre-P6 processors.

14. Peter Doubleday says:

My immediate thought was “if it isn’t inlined, the cost is dwarfed by stack operations.” I’m not specifically (C++ or C99, and let’s be honest, that’s the level of language we’re talking here) bothered about offering up a comparison function that “possibly” mutates global state. Well, I would be, if the compiler in question doesn’t respect the const qualifier. But then I’d have other worries…

I love this post. I like the argument that “if you’re going to think about optimisation, think very carefully about it,” which is actually Knuth’s corollary. May I also recommend Joe Duffy’s thoughts on the matter to your readers?

joeduffyblog.com/2010/09/06/the-premature-optimization-is-evil-myth/