The wrong way of benchmarking the most efficient integer comparison function


On StackOverflow, there's a question about the most efficient way to compare two integers and produce a result suitable for a comparison function, where a negative value means that the first value is smaller than the second, a positive value means that the first value is greater than the second, and zero means that they are equal.

There was much microbenchmarking of various options, ranging from the straightforward

int compare1(int a, int b)
{
    if (a < b) return -1;
    if (a > b) return 1;
    return 0;
}

to the clever

int compare2(int a, int b)
{
    return (a > b) - (a < b);
}

to the hybrid

int compare3(int a, int b)
{
    return (a < b) ? -1 : (a > b);
}

to inline assembly

int compare4(int a, int b)
{
    __asm__ __volatile__ (
        "sub %1, %0 \n\t"
        "jno 1f \n\t"
        "cmc \n\t"
        "rcr %0 \n\t"
        "1: "
    : "+r"(a)
    : "r"(b)
    : "cc");
    return a;
}

The benchmark pitted the comparison functions against each other by comparing random pairs of numbers and adding up the results to prevent the code from being optimized out.

But here's the thing: Adding up the results is completely unrealistic.

There are no meaningful semantics that could be applied to a sum of numbers for which only the sign is significant. No program that uses a comparison function will add the results. The only thing you can do with the result is compare it against zero and take one of three actions based on the sign.

Adding up all the results means that you're not using the function in a realistic way, which means that your benchmark isn't realistic.

Let's try to fix that. Here's my alternative test:

// Looks for "key" in sorted range [first, last) using the
// specified comparison function. Returns iterator to found item,
// or last if not found.

template<typename It, typename T, typename Comp>
It binarySearch(It first, It last, const T& key, Comp compare)
{
 // invariant: if key exists, it is in the range [first, first+length)
 // This binary search avoids the integer overflow problem
 // by operating on lengths rather than ranges.
 auto length = last - first;
 while (length > 0) {
  auto step = length / 2;
  It it = first + step;
  auto result = compare(*it, key);
  if (result < 0) {
   first = it + 1;
   length -= step + 1;
  } else if (result == 0) {
   return it;
  } else {
   length = step;
  }
 }
 return last;
}

int main(int argc, char **argv)
{
 // initialize the array with sorted even numbers
 int a[8192];
 for (int i = 0; i < 8192; i++) a[i] = i * 2;

 for (int iterations = 0; iterations < 1000; iterations++) {
  int correct = 0;
  for (int j = -1; j < 16383; j++) {
   auto it = binarySearch(a, a+8192, j, COMPARE);
   if (j < 0 || j > 16382 || j % 2) correct += it == a+8192;
   else correct += it == a + (j / 2);
  }
  // if correct != 16384, then we have a bug somewhere
  if (correct != 16384) return 1;
 }
 return 0;
}

Let's look at the code generation for the various comparison functions. I used gcc.godbolt.org with x86-64 gcc 7.2 and optimization -O3.

If we try compare1, then the binary search looks like this:

    ; on entry, esi is the value to search for

    lea rdi, [rsp-120]          ; rdi = first
    mov edx, 8192               ; edx = length
    jmp .L9
.L25:                           ; was greater than
    mov rdx, rax                ; length = step
    test rdx, rdx               ; while (length > 0)
    jle .L19
.L9:
    mov rax, rdx                ;
    sar rax, 1                  ; eax = step = length / 2
    lea rcx, [rdi+rax*4]        ; it = first + step

    ; result = compare(*it, key), and then test the result
    cmp dword ptr [rcx], esi    ; compare(*it, key)
    jl .L11                     ; if less than
    jne .L25                    ; if not equal (therefore if greater than)
    ... return value in rcx     ; if equal, answer is in rcx

.L11:                           ; was less than
    add rax, 1                  ; step + 1
    lea rdi, [rcx+4]            ; first = it + 1
    sub rdx, rax                ; length -= step + 1
    test rdx, rdx               ; while (length > 0)
    jg .L9
.L19:
    lea rcx, [rsp+32648]        ; rcx = last
    ... return value in rcx

Exercise: Why is rsp - 120 the start of the array?

Observe that despite using the lamest, least-optimized comparison function, we got the comparison-and-test code that is much what we would have written if we had done it in assembly language ourselves: We compare the two values, and then follow up with two branches based on the same shared flags. The comparison is still there, but the calculation and testing of the return value are gone.

In other words, not only was compare1 optimized down to one cmp instruction, but it also managed to delete instructions from the binarySearch function too. It had a net cost of negative instructions!

What happened here? How did the compiler manage to optimize out all our code and leave us with the shortest possible assembly language equivalent?

Simple: First, the compiler did some constant propagation. After inlining the compare1 function, the compiler saw this:

    int result;
    if (*it < key) result = -1;
    else if (*it > key) result = 1;
    else result = 0;
    if (result < 0) {
      ... less than ...
    } else if (result == 0) {
      ... equal to ...
    } else {
      ... greater than ...
    }

The compiler realized that it already knew whether constants were greater than, less than, or equal to zero, so it could remove the test against result and jump straight to the answer:

    int result;
    if (*it < key) { result = -1; goto less_than; }
    else if (*it > key) { result = 1; goto greater_than; }
    else { result = 0; goto equal_to; }
    if (result < 0) {
less_than:
      ... less than ...
    } else if (result == 0) {
equal_to:
      ... equal to ...
    } else {
greater_than:
      ... greater than ...
    }

And then it saw that all of the tests against result were unreachable code, so it deleted them.

    int result;
    if (*it < key) { result = -1; goto less_than; }
    else if (*it > key) { result = 1; goto greater_than; }
    else { result = 0; goto equal_to; }

less_than:
      ... less than ...
      goto done;

equal_to:
      ... equal to ...
      goto done;

greater_than:
      ... greater than ...
done:

That then left result as a write-only variable, so it too could be deleted:

    if (*it < key) { goto less_than; }
    else if (*it > key) { goto greater_than; }
    else { goto equal_to; }

less_than:
      ... less than ...
      goto done;

equal_to:
      ... equal to ...
      goto done;

greater_than:
      ... greater than ...
done:

Which is equivalent to the code we wanted all along:

    if (*it < key) {
      ... less than ...
    } else if (*it > key) {
      ... greater than ...
    } else {
      ... equal to ...
    }

The last optimization is realizing that the test in the else if could use the flags left over by the if, so all that was left was the conditional jump.

Some very straightforward optimizations took our very unoptimized (but easy-to-analyze) code and turned it into something much more efficient.

On the other hand, let's look at what happens with, say, the second comparison function:

    ; on entry, edi is the value to search for

    lea r9, [rsp-120]           ; r9 = first
    mov ecx, 8192               ; ecx = length
    jmp .L9
.L11:                           ;
    test eax, eax               ; result == 0?
    je .L10                     ; Y: found it
                                ; was greater than
    mov rcx, rdx                ; length = step
    test rcx, rcx               ; while (length > 0)
    jle .L19
.L9:
    mov rdx, rcx
    xor eax, eax                ; return value of compare2
    sar rdx, 1                  ; rdx = step = length / 2
    lea r8, [r9+rdx*4]          ; it = first + step

    ; result = compare(*it, key), and then test the result
    mov esi, dword ptr [r8]     ; esi = *it
    cmp esi, edi                ; compare *it with key
    setl sil                    ; sil = 1 if less than
    setg al                     ; al  = 1 if greater than
                                ; eax = 1 if greater than
    movzx esi, sil              ; esi = 1 if less than
    sub eax, esi                ; result = (greater than) - (less than)
    cmp eax, -1                 ; less than zero?
    jne .L11                    ; N: Try zero or positive

                                ; was less than
    add rdx, 1                  ; step + 1
    lea r9, [r8+4]              ; first = it + 1
    sub rcx, rdx                ; length -= step + 1
    test rcx, rcx               ; while (length > 0)
    jg .L9
.L19:
    lea r8, [rsp+32648]         ; r8 = last
.L10:
    ... return value in r8

The second comparison function compare2 uses the relational comparison operators to generate exactly 0 or 1. This is a clever way of generating -1, 0, or +1, but unfortunately, that was not our goal in the grand scheme of things. It was merely a step toward that goal. The way that compare2 calculates the result is too complicated for the optimizer to understand, so it just does its best at calculating the formal return value from compare2 and testing its sign. (The compiler does realize that the only possible negative value is -1, but that's not enough insight to let it optimize the entire expression away.)

If we try compare3, we get this:

    ; on entry, esi is the value to search for

    lea rdi, [rsp-120]          ; rdi = first
    mov ecx, 8192               ; ecx = length
    jmp .L12
.L28:                           ; was greater than
    mov rcx, rax                ; length = step
.L12:
    mov rax, rcx
    sar rax, 1                  ; rax = step = length / 2
    lea rdx, [rdi+rax*4]        ; it = first + step

    ; result = compare(*it, key), and then test the result
    cmp dword ptr [rdx], esi    ; compare(*it, key)
    jl .L14                     ; if less than
    jle .L13                    ; if less than or equal (therefore equal)

    ; "length" is in eax now
.L15:                           ; was greater than
    test eax, eax               ; length == 0?
    jg .L28                     ; N: continue looping
    lea rdx, [rsp+32648]        ; rdx = last
.L13:
    ... return value in rdx

.L14:                           ; was less than
    add rax, 1                  ; step + 1
    lea rdi, [rdx+4]            ; first = it + 1
    sub rcx, rax                ; length -= step + 1
    mov rax, rcx                ; rax = length
    jmp .L15

The compiler was able to understand this version of the comparison function: It observed that if a < b, then the result of compare3 is always negative, so it jumped straight to the less-than case. Otherwise, it observed that the result was zero if a is not greater than b and jumped straight to that case too. The compiler did have some room for improvement with the placement of the basic blocks, since there is an unconditional jump in the inner loop, but overall it did a pretty good job.

The last case is the inline assembly with compare4. As you might expect, the compiler had the most trouble with this.

    ; on entry, edi is the value to search for

    lea r8, [rsp-120]           ; r8 = first
    mov ecx, 8192               ; ecx = length
    jmp .L12
.L14:                           ; zero or positive
    je .L13                     ; zero - done
                                ; was greater than
    mov rcx, rdx                ; length = step
    test rcx, rcx               ; while (length > 0)
    jle .L22
.L12:
    mov rdx, rcx
    sar rdx, 1                  ; rdx = step = length / 2
    lea rsi, [r8+rdx*4]         ; it = first + step

    ; result = compare(*it, key), and then test the result
    mov eax, dword ptr [rsi]    ; eax = *it
    sub eax, edi
    jno 1f
    cmc
    rcr eax, 1
1:
    test eax, eax               ; less than zero?
    jne .L14                    ; N: Try zero or positive

                                ; was less than
    add rdx, 1                  ; step + 1
    lea r8, [rsi+4]             ; first = it + 1
    sub rcx, rdx                ; length -= step + 1
    test rcx, rcx               ; while (length > 0)
    jg .L12
.L22:
    lea rsi, [rsp+32648]        ; rsi = last
.L13:
    ... return value in rsi

This is pretty much the same as compare2: The compiler has no insight at all into the inline assembly, so it just dumps it into the code like a black box, and then once control exits the black box, it checks the sign in a fairly efficient way. But it had no real optimization opportunities because you can't really optimize inline assembly.

The conclusion of all this is that optimizing the instruction count in your finely-tuned comparison function is a fun little exercise, but it doesn't necessarily translate into real-world improvements. In our case, we focused on optimizing the code that encodes the result of the comparison without regard for how the caller is going to decode that result. The contract between the two functions is that one function needs to package some result, and the other function needs to unpack it. But we discovered that the more obtusely we wrote the code for the packing side, the less likely the compiler would be able to see how to optimize out the entire hassle of packing and unpacking in the first place. In the specific case of comparison functions, it means that you may want to return +1, 0, and -1 explicitly rather than calculating those values in a fancy way, because it turns out compilers are really good at optimizing "compare a constant with zero".

You have to see how your attempted optimizations fit into the bigger picture because you may have hyper-optimized one part of the solution to the point that it prevents deeper optimizations in other parts of the solution.

Bonus chatter: If the comparison function is not inlined, then all of these optimization opportunities disappear. But I personally wouldn't worry about it too much, because if the comparison function is not inlined, then the entire operation is going to be dominated by the function call overhead: Setting up the registers for the call, making the call, returning from the call, testing the result, and most importantly, the lost register optimization opportunities not only because the compiler loses opportunities to enregister values across the call, but also because the compiler has to protect against the possibility that the comparison function will mutate global state and consequently create aliasing issues.

Comments (43)
  1. Adrian says:

    Once again, turns out readable code means good (and efficient) code. While I have no insight on how compilers evolved (nor I have used any pre-2010 compilers), this kind of an optimization might have made sense 10 years ago, back when optimizers weren’t as smart as nowadays?

  2. Mr Cranky says:

    Seems to me the obvious way to code this is:

    int compare0(int a, int b)
    {
    return (a-b);
    }

    Awfully long discussion that boils down to Dr. Knuth’s long-standing axiom that the root of all evil is premature optimization.

    1. And then it returns the wrong value for compare0(-1610612736, 1610612736).

      1. henke37 says:

        Given that int can be a short int, the call is wrong to begin with. Really, the function is safe if you know of the range limitations.

        1. Okay, let me rephrase it then: It returns the wrong value for compare0(INT_MIN, INT_MAX).

      2. Viila says:

        I wonder how the corrected version
        int64_t compare64(int32_t a, int32_t b)
        {
        return int64_t(a) – int64_t(b);
        }
        compares to the others, especially if compiled as x64 instead of x86.

        1. Joshua says:

          It probably doesn’t work. The prototype returns int.

          I can make it execute in zero cycles if it doesn’t have to work.

        2. No need to ask. Type it into gcc.godbolt.org and find out! (Note that gcc.godbolt.org doesn’t support 32-bit compilers so the extra cost of 64-bit arithmetic on 32-bit systems is not visible there.)

          1. Viila says:

            I can, but I’m afraid I don’t know how to properly interpret the results because I don’t speak x86/x64 assembly.

            (Joshua, the prototype was templated and the return value was assigned into an ‘auto’ variable, so it wouldn’t be truncated either. It should work.)

          2. ErikF says:

            If you look at the assembly results, you can hover over the instructions (or right-click and choose “View asm doc”) and you’ll get a brief synopsis of what the instruction does. Also, you can click on “Show optimization output (clang only)” to see what choices the optimizer made for each line of code (gcc has a similar option.) If you grok ARM(32/64), AVR, MIPS(32/64), MSP430, or PPC better, those compilers are available too, as well as a couple of other compilers (MSVC++, elicc, icc, and Zapcc). It’s pretty cool: I wish I had this when I first started out learning programming!

      3. Rick C says:

        I had a friend in college who wrote, as part of his first programming course, a version of a factorial program. If the caller passed in a number greater than around, or whatever would result in integer overflow, it would return 0 and print out “I only work with small, non-negative integers.”

    2. Ben Adams says:

      “premature optimization is the root of all evil”

      Is very out of context; full quote is:

      “Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

  3. creaothceann says:

    “@Raymond I’m forced to agree.”

    Heh.

  4. The dangers of being too clever.

  5. pmbAustin says:

    As compilers get smarter, the sins of “premature optimization” have worse and worse consequences.

    Always write the code to be understood by some human being that comes after you and has to maintain it.

    1. yukkuri says:

      Seriously. Let the compiler did its job and you do -your- job: writing maintainable code.

  6. Ted Spence says:

    Somebody once had a quote about “Premature Optimization”. I’m not sure who it was, maybe someone famous. /s

    Also: we tend to spend many times more hours reading our code rather than writing it. If you optimize for readability, chances are you have a simpler bit of code, and the compiler will be able to do more optimizations.

    If you write the world’s most clever code, chances are the poor intern who tries to fix a bug six years later will have trouble, and the compiler won’t like it either.

    Also: I had not seen this compiler explorer website before ( https://gcc.godbolt.org/ ). This is the greatest thing since compiled bread. If only I still wrote c++! I sometimes regret switching to C# and leaving the world of pointer math and insane templates behind.

    1. smf says:

      Premature optimisation is subjective, one person thought it was the right and someone else said it was done too early. We assume the person who comes along and says it was premature optimisation, but maybe at the time it was the right thing to do. Or it’s not, who knows.

      I’ve been optimising some 30 year old 6502 code, does that count as premature or not?

      1. Darran Rowe says:

        I always see the term premature optimisation as quite objective.
        What it means is an attempt at manual optimisation before you know if what you are intending to optimise is really the bottle neck. So any time you look at code and think “that looks slow” then it is likely premature.
        So basically if you have tested and timed and after all this you find that you are in a certain code path a lot and that is causing you a lot of slow down, then that is the code that you should optimise.

        1. Ted Spence says:

          Agreed – everything is subjective until you measure it!

          Shorthand: Write code that is easy to read until you have a measurement.

  7. Kevin says:

    > But it had no real optimization opportunities because you can’t really optimize inline assembly.

    At first I found this confusing, but then I realized why: Inline assembly cannot be decompiled into an intermediate representation, because it does not follow the semantics of the C abstract machine. Even assuming you have a really good decompiler available, you cannot apply as-if rule transformations to decompiled inline assembly. You might end up altering the meaning of the assembly in ways that the C standard does not consider “observable behavior,” but which nonetheless leave the CPU or other components in a different state from what the programmer expected. While this might be legal because asm is implementation-defined, it is obviously unhelpful to the point that the average user would call it a bug. So the compiler is obligated to treat the assembly as a black box, as Raymond says.

    1. Paul Topping says:

      It is possible to optimize inline assembler but there’s not much reason a C/C++ compiler would provide that functionality. The inline assembler feature exists solely to allow the programmer to specify an instruction sequence exactly and expects the compiler not to change it.

      1. Joshua says:

        Correct. The only purpose left for asm blocks is to give the compiler information it cannot possibly have.

        About half the asm blocks I see don’t have any assembler code in them, just compiler hints (hey compiler, this empty asm block modifies this local variable).

      2. Jon says:

        The whole point of inline assembly is for when you want to do something the compiler doesn’t allow.

        Perhaps you are using an assembly instruction the compiler doesn’t allow. For example: CPUID, when it first came out.

        Or perhaps you have some heavily-optimized code that you’ve written taking into account caches and things that the compiler isn’t smart enough to optimize on its own.

        Or perhaps you’re doing something really weird where you need exact control over the instructions. For example, a popular open-source kernel has a copy-from-userspace macro that uses inline assembly, and emits special exception-handling instructions so that a segfault during the copy can be handled specially. Another example: detecting 386 vs 486 processor by emitting instructions that get processed differently on the two processors. A third example: An open source memory debugging tool runs a program in a sort-of-VM, with JIT compilation to add checks to all memory accesses, and it has a magic sequence of instructions that do nothing useful on a normal CPU but that precise sequence of instructions are detected by the JIT and allow you to pass information to the memory debugger.

        In ALL of those cases, there’s nothing sensible or helpful that the compiler’s optimizer can do, and if it touches the code at all then it’s likely to break it. So compilers do not try to “optimize” inline assembly, and this is the correct behaviour.

        In the cases where the compiler’s optimizer could do something useful, such as writing SSE code, then the opcodes are understood and supported by the compiler. In those cases, if you want the compiler to optimize for you, you could and should be using compiler intrinsics instead.

  8. Paul Topping says:

    Good post! As you point out at the end, while modern compilers do a wonderful job of optimization, far beyond what most programmers would come up with given limited time and imagination, there are still reasons for programmers to look at what the compiler generated in order to do further optimization but the task has changed. Instead of trying to improve on the compiler’s code generation, the programmer should look for things that the source code implicitly asks to be computed and remove those that are not needed to achieve the program’s purpose. Perhaps some future compiler could output suggestions along this line.

  9. Roman says:

    I tried pasting your code into Compiler Explorer, but got different disassembly. Did you use any options other than -O3?

    1. Hm, the output is slightly different for compare1 now. The conclusions still stand, though.

  10. pm100 says:

    This discussion completely ignores what happens once the CPU gets hold of it. CPU effectively runs another optimizer on the code , but at run time. I was surprised that you didnt actually time the execution, rather than just reading the generated code. I totally agree that people are fixated on ‘less lines means faster’.

    1. Antonio Rodríguez says:

      The object code for the different methods is very different. You can assume that three instructions will execute faster than nine instructions, so compare1 and compare3 will be faster without doubt. This can be false if something goes wrong (i.e., unexpected cache fail or pipeline flush), but in that case, any optimization is futile.

      1. Dmitry says:

        You can’t. You once could but not these days. Modern processors do a lot of runtime optimizations. Three instructions might generally show worse results than nine equivalent instructions due to pipelining issues (instructions pairing, for example). Besides, I once saw a (synthetic) test where a floating-point instruction in a separate procedure turned out to be faster when timed than the same instruction inlined (the call cost turned out to be negative) with processors of one major CPU vendor and slightly slower on processors of another one, even though the name of the second vendor was shorter than that of the first :-)

      2. smf says:

        >You can assume that three instructions will execute faster than nine instructions

        You can assume that, but that might not be a correct assumption. It would be nice if you could see the timing for each different cpu, but it’s a huge amount of work.

      3. Antonio Rodríguez says:

        I’d agree in case of a smaller difference (i.e., four/five instructions versus three). But in the case of nine to three, you’d have to go to the most extreme case for the nine instructions to be faster (nine simple register-to-register instructions without memory access which translate into a single micro-op each versus three register-to-memory with complex addressing modes which can translate to three or four micro-ops each).

        Both compare1 and compare3 are a memory (register indirect, without indexing) to register compare (two micro-ops) and two conditional branches (one micro-op each). A total of four micro-ops. I can’t imagine how a processor can optimize at run time nine instructions in less than four micro-ops, when the compiler hasn’t been able to do the same at compile time.

        You can argue about out-of-order execution, pipeline stalls and the like. But when you take into account that compare1 is a strict subset of compare2 (in other words, compare2 contains all three compare1 instructions plus six additional ones), it’s difficult to imagine a case where the longer sequence is less probable to get stalled.

      4. Patrick says:

        You can most definitely not assume that.

        Comparisions actually being one of the most common cases nowadays since conditional jumps can really screw with branch prediction if the pattern isn’t, well, predictable. In this case I don’t think it really matters unless you can do the entire operation the comparision is used for without conditional branching, but there are some cases where it definitely does.

        For example, binary searches perform much worse than you’d expect from just looking at the instructions themselves, since by definition they branch in the worst possible pattern (0.5/0.5 taken/not taken).
        A linear search is actually faster for N < surprisingly large number – rule of thumb says around 8 or 10 depending on who you ask.

        And for some other, more traditional, examples: x86 has quite a few 'big' instructions that are actually slower than just doing the same thing on your own. The real classic here being LOOP, which no performance-aware assembly language programmer or compiler has ever used. Even last time I benched it on a modern system it was significantly slower than the equivalent DEC + JNZ (roughly 2x). I suppose the modern CPUs could very well optimize it to be just as fast but there's no need to because of this.
        (And DEC is in turn slower than SUB reg,1 on modern CPUs because of a historical design mistake which gives DEC a dependency on a flag which SUB lacks…)

  11. Bender says:

    For future posts you can use http://quick-bench.com/ for quick comparisons

  12. Bram Speeckaert says:

    The -120 comes from GCC taking advantage of the 128-byte red zone. The AMD64 ABI guarantees that the 128 bytes beyond whatever rsp points to will never get modified by interrupt handlers, so it’s okay to grow the stack pointer by slightly less than you actually need and just make the array start in that memory region.

  13. smf says:

    “with x86-64 gcc 7.2 and optimization -O3.”

    wait, what? godbolt supports msvc too, so why did you show us the result of gcc????

    1. Yeesh, I poke my head out of the bubble and you tell me to get back in the bubble. It’s not like this issue applies only to Microsoft compilers.

      1. Joshua says:

        I agree. Yeesh.

      2. smf says:

        My faux surprise was a fishing attempt in case there was some interesting news that you wouldn’t answer a direct question about.

        You can stay out of the bubble :D

      3. xcomcmdr says:

        I can already see the clickbait headline :
        “Breaking news – MSVC might be replaced by gcc, a Microsoft lead architect says.”

        (Oh God, what have I done…)

    2. ErikF says:

      Why not? Almost every compiler does things differently, and it’s instructive to see what the differences are (and why they’re done that way.) For example, it appears that clang prefers cmov instructions, which is a potential speed boost but is a problem if you have to support pre-P6 processors.

  14. Peter Doubleday says:

    My immediate thought was “if it isn’t inlined, the cost is dwarfed by stack operations.” I’m not specifically (C++ or C99, and let’s be honest, that’s the level of language we’re talking here) bothered about offering up a comparison function that “possibly” mutates global state. Well, I would be, if the compiler in question doesn’t respect the const qualifier. But then I’d have other worries…

    I love this post. I like the argument that “if you’re going to think about optimisation, think very carefully about it,” which is actually Knuth’s corollary. May I also recommend Joe Duffy’s thoughts on the matter to your readers?

    joeduffyblog.com/2010/09/06/the-premature-optimization-is-evil-myth/

Comments are closed.

Skip to main content