Non-classical processor behavior: How doing something can be faster than not doing it


Consider the following program:

#include <windows.h>
#include <stdlib.h>
#include <stdlib.h>
#include <stdio.h>

int array[10000];

int countthem(int boundary)
{
 int count = 0;
 for (int i = 0; i < 10000; i++) {
  if (array[i] < boundary) count++;
 }
 return count;
}

int __cdecl wmain(int, wchar_t **)
{
 for (int i = 0; i < 10000; i++) array[i] = rand() % 10;

 for (int boundary = 0; boundary <= 10; boundary++) {
  LARGE_INTEGER liStart, liEnd;
  QueryPerformanceCounter(&liStart);

  int count = 0;
  for (int iterations = 0; iterations < 100; iterations++) {
   count += countthem(boundary);
  }

  QueryPerformanceCounter(&liEnd);
  printf("count=%7d, time = %I64d\n",
         count, liEnd.QuadPart - liStart.QuadPart);
 }
 return 0;
}

The program generates a lot of random integers in the range 0..9 and then counts how many are less than 0, less than 1, less than 2, and so on. It also prints how long the operation took in QPC units. We don't really care how big a QPC unit is; we're just interested in the relative values. (We print the number of items found merely to verify that the result is close to the expected value of boundary * 100000.)

Here are the results:

boundary count time
0 0 1869  
1 100000 5482  
2 200800 8152  
3 300200 10180  
4 403100 11982  
5 497400 12092  
6 602900 11029  
7 700700 9235  
8 797500 7051  
9 902500 4537  
10 1000000 1864  

To the untrained eye, this chart is strange. Here's the naïve analysis:

When the boundary is zero, there is no incrementing at all, so the entire running time is just loop overhead. You can think of this as our control group. We can subtract 1869 from the running time of every column to remove the loop overhead costs. What remains is the cost of running count increment instructions.

The cost of a single increment operation is highly variable. At low boundary values, it is around 0.03 time units per increment. But at high boundary values, the cost drops to one tenth that.

What's even weirder is that once the count crosses 600,000, each addition of another 100,000 increment operations makes the code run faster, with the extreme case when the boundary value reaches 10, where we run faster than if we hadn't done any incrementing at all!

How can the running time of an increment instruction be negative?

The explanation for all this is that CPUs are more complicated than the naïve analysis realizes. We saw earlier that modern CPUs contain all sorts of hidden variables. Today's hidden variable is the branch predictor.

Executing a single CPU instruction takes multiple steps, and modern CPUs kick off multiple instructions in parallel, with each instruction at a different stage of execution, a technique known as pipelining.

Conditional branch instructions are bad for pipelining. Think about it: When a conditional branch instruction enters the pipeline, the CPU doesn't know whether the condition will be true when the instruction reaches the end of the pipeline. Therefore, it doesn't know what instruction to feed into the pipeline next.

Now, it could just sit there and let the pipeline sit idle until the branch/no-branch decision is made, at which point it now knows which instruction to feed into the pipeline next. But that wastes a lot of pipeline capacity, because it will take time for those new instructions to make it all the way through the pipeline and start doing productive work.

To avoid wasting time, the processor has an internal branch predictor which remembers the recent history of which conditional branches were taken and which were not taken. The fanciness of the branch predictor varies. Some processors merely assume that a branch will go the same way that it did the last time it was countered. Others keep complicated branch history and try to infer patterns (such as "the branch is taken every other time").

When a conditional branch is encountered, the branch predictor tells the processor which instructions to feed into the pipeline. If the branch prediction turns out to be correct, then we win! Execution continues without a pipeline stall.

But if the branch prediction turns out to be incorrect, then we lose! All of the instructions that were fed into the pipeline need to be recalled and their effects undone, and the processor has to go find the correct instructions and start feeding them into the pipeline.

Let's look at our little program again. When the boundary is 0, the result of the comparison is always false. Similarly, when the boundary is 10, the result is always true. In those cases, the branch predictor can reach 100% accuracy.

The worst case is when the boundary is 5. In that case, half of the time the comparison is true and half of the time the comparison is false. And since we have random data, fancy historical analysis doesn't help any. The predictor is going to be wrong half the time.

Here's a tweak to the program: Change the line

     if (array[i] < boundary) count++;

to

     count += (array[i] < boundary) ? 1 : 0;

This time, the results look like this:

boundary count time
0 0 2932  
1 100000 2931  
2 200800 2941  
3 300200 2931  
4 403100 2932  
5 497400 2932  
6 602900 2932  
7 700700 2999  
8 797500 2931  
9 902500 2932  
10 1000000 2931  

The execution time is now independent of the boundary value. That's because the optimizer was able to remove the branch from the ternary expression:

; on entry to the loop, ebx = boundary

    mov edx, offset array ; start at the beginning of the array
$LL3:
    xor ecx, ecx    ; start with zero
    cmp [edx], ebx  ; compare array[i] with boundary
    setl cl         ; if less than boundary, then set al = 1
    add eax, ecx    ; accumulate result in eax

    add edx, 4      ; loop until end of array
    cmp edx, offset array + 40000
    jl $LL3

Since there are no branching decisions in the inner loop aside from the loop counter, there is no need for a branch predictor to decide which way the comparison goes. The same code executes either way.

Exercise: Why are the counts exactly the same for both runs, even though the dataset is random?

Comments (53)
  1. Joshua says:

    <Insert babble about crypto RNG or lack thereof>

    Nah, it's deliberate to make program runs repeatable.

  2. Dave says:

    The RNG isn't initialized in your program, so it just uses a hard-coded default somewhere (in the C runtime?) for it's seed.  This results in the same output from the RNG for each run of the program.

  3. VinDuv says:

    The dataset stays the same between runs, because the rand seed is not initialized (srand is not called), thus it has the same (default) value at each run.

    Couldn’t the optimizer change the “if (…) x++;” into “x += …” automatically ?

  4. Henke37 says:

    VinDuv, the optimizer could do that if x is not volatile and it can deal with all side effects of the condition.

  5. Gabe says:

    What I find impressive is that it takes 1869 units to count 0 items and 1864 units to count 1000000 items. In other words, it took -4 time units to execute count++ 1000000 times!

    Of course it's not that hard to explain, but I'll avoid giving the answer in case you're discussing it in the next episode of "How doing something can be faster than not doing it".

    [I already pointed out that the running time of the increment instruction is negative in case 10. -Raymond]
  6. Joker_vD says:

    How does context switches affect results of QPC? It's not like whole wmain execution will fit into one quantum, so… do we actually measure the CPU time spent on wmain? Or the wall time?

  7. Antonio &#39;Grijan&#39; says:

    The counts are the same in both runs because the dataset isn't actually random. Calling the rand() function without setup will give you the same sequence of numbers every run, because it's based in a fixed seed (which, by the way, can be interesting for debugging and testing, because it reduces variance and produces predictable results). To get a truly pseudo-random sequence you must set a different seed every run by calling srand() with some kind of variable argument (such as the current time) at program startup.

  8. Mordachai says:

    Excellent article!  It's obvious in hind-sight, and I realized immediately it had to be an effect of the branch-predictor, but having the explanation removed the clouds & cobwebs.  So, thanks!

    Also – surprised at how much better the code is overall with the ternary instead of the branch.  I'll have to think about using such algorithms more often, if this pattern reappears in my code (as it definitely does sometimes).

  9. Engywuck says:

    Without reading your explanation I'd at first thought that the compiler was intelligent enough to notice that "all between 0 and x" is the same as "all minus all those between x+1 and 9 (always including the boundaries)" and therefore having the mirrored result. After reading your explanation I slapped my head because it should've been obvious…

  10. bcs says:

    Any compiler worth it's bytes will optimize the first version into the branch-less version.

  11. Adam Rosenfield says:

    @bcs: Agreed.  With optimizations turned on, any decent compiler will treat the two code snippets identically.  I tested with Clang and got these results for the generated assembly:

    At -O0 (no optimization), Clang generates a branch for the if/then version and a branchless cmovl instruction (conditional move if less) instruction for the ternary operator version.

    At -O1, it generates branchless bit twiddling code (cmpl, setl, movzbl, addl) for both versions of the source code.

    At -O2 and -O3, it generates a branchless version using vectorized SSE instructions.

  12. Jeffrey Bosboom says:

    There's an excellent diagnosis of the same effects in this Stack Overflow answer stackoverflow.com/…/11227902 , this time having to do with presorting the data increasing the branch predictor's accuracy.

  13. Raphael says:

    Notice how the second, branch-less version is more efficient… 8 times out of 10.

    If the developer know with a high enough degree of certainty that the branch will be taken either almost always or almost never, the version with a branch is actually a better bet.

  14. MItaly says:

    About the non initialized RNG, the C standard substantially says that there's always an "implicit" srand(1) before the start of the program.

  15. Evan says:

    Count me among the people who are surprised by the optimizer not transforming it into the branchless version (assuming optimization was enabled). Even many years ago I noticed that

      if (y == 0)

         x = 10;

      else

         x = 30;

    is compiled into a branchless version by MSVC & other compilers.

    The compiled code varied a bit by compiler of course, but it was in the spirit of

      cmp eax, 0  ; assume y is in eax; x is in ebx

      seteq bl    ; bl  == (y == 0) ?  1 : 0

      sub ebx, 1  ; ebx == (y == 0) ?  0 : -1 (0xFF..F)

      and ebx, 20 ; ebx == (y == 0) ?  0 : 20

      add ebx, 10 ; ebx == (y == 0) ? 10 : 30

    GCC now uses a cmov instead.

  16. Antonio 'Grijan' says:

    @Gabe: actually, both cases are running almost the same code. In a perfect world, they should give approximately the same units, yes. But we don't live in a perfect world (yet!), and profiling results are semi-random in nature. For example, even if the system does not count time spent in other threads, they can dirt the processor cache or page out some of our process' memory – and cache or paging faults in out thread *will* affect profiling, even when the cause lies outside.

  17. Joshua says:

    Hmm I don't know what setqu instruction is and a Google search for "x86 seteq" yielded no good hits. Care to enlighten us?

  18. Adam Rosenfield says:

    @Joshua: I suspect that Evan meant to write "sete" instead of "seteq".  "sete" is one of the class of setcc (set on condition code) instructions (pdos.csail.mit.edu/…/SETcc.htm) which sets the destination to a constant 0 or 1 depending on the values in the (e)flags register from the most recent comparison or arithmetic operation.  In the case of sete, it sets the destination to 1 if the last comparison resulted in equality.

  19. Daniel Neely says:

    Does MSVC's optimizer fail to figure out it could fix v1 to work like v2; or did you do a non-optimized build to create an interesting result with an easy to understand micro-bench?

    ["Compiler optimizations may have been carefully selected for expository purposes." -Raymond]
  20. Sebastien Lorion says:

    I was curious to see if .NET (4.5, VS 2013) was able to optimize for the second case and the answer is no, both versions have the behavior of the branching version. Here is the code:

    using System;

    using System.Diagnostics;

    namespace ConsoleApplication1

    {

     class Program

     {

       static void Main(string[] args)

       {

         var rnd = new Random();

         for (int i = 0; i < 10000; i++)

           array[i] = rnd.Next(10);

         var sw = new Stopwatch();

         for (int boundary = 0; boundary <= 10; boundary++)

         {

           sw.Restart();

           int count = 0;

           for (int iterations = 0; iterations < 100; iterations++)

           {

             count += countthem1(boundary);

           }

           sw.Stop();

           Console.WriteLine("1- count={0,-7}, time = {1}", count, sw.ElapsedTicks);

           sw.Restart();

           count = 0;

           for (int iterations = 0; iterations < 100; iterations++)

           {

             count += countthem2(boundary);

           }

           sw.Stop();

           Console.WriteLine("2- count={0,-7}, time = {1}", count, sw.ElapsedTicks);

           Console.WriteLine("———————————–");

         }

         Console.ReadLine();

       }

       static int[] array = new int[10000];

       static int countthem1(int boundary)

       {

         int count = 0;

         for (int i = 0; i < 10000; i++)

         {

           if (array[i] < boundary) count++;

         }

         return count;

       }

       static int countthem2(int boundary)

       {

         int count = 0;

         for (int i = 0; i < 10000; i++)

         {

           count += (array[i] < boundary) ? 1 : 0;

         }

         return count;

       }

     }

    }

  21. Muzer_ says:

    I've read enough of this type of article such that as soon as I saw the title, I thought "branch predictor". Glad I wasn't wrong ;)

  22. Joshua says:

    @Josh: It is standard. It's usually considered bad form to depend on 1 rather than not 0 but hey.

  23. Myria says:

    I'm wondering whether x86-32 platforms would benefit from using self-modifying code for resolving import thunks due to this problem.  Every call to an imported function is an indirect call, which branch predictors just love so much.  This is because import thunks are just a pointer variable that gets filled in with the actual address of the imported function or variable.  These pointers have names like __imp__SendMessageW@16.

    Calling with __declspec(dllimport) set up correctly looks like this:

    call dword ptr [__imp__SendMessageW@16]

    Calling without using __declspec(dllimport) being used results in the linker generating a stub to fix the problem, at a performance cost:

    call near _SendMessageW@16

    _SendMessageW@16:

    jmp dword ptr [__imp__SendMessageW@16]

    Either way, you end up with an indirect branch.

    However, what if we compiled as in the second case, without using __declspec(dllimport), but then at runtime did a series of machine code patches?  What if we replaced each of these linker-generated stubs with immediate branches like "jmp near _SendMessageW@16"?  x86-32 can do immediate branching to its entire address space.  (Unlike x86-64, which has the same range as x86-32 for direct branches; this means that it's limited to ±2 GB.)

    I don't have enough data to determine whether this is any faster.  You do two branches instead of one (comparing to dllimport), but they're both predictable because they're unconditional.  A solution in which there were a single call instead of a call plus a jmp is possible, but would require a custom linker, and moreover would guarantee that your executable's code pages would never be shared among processes.

    [I think the loss of code page sharing is the killer blow. Without code page sharing, memory usage would skyrocket. -Raymond]
  24. amroamroamro says:

    Looking at what looks like a normal distribution in the if/then version, can we say anything about the "fanciness" of the branch predictor on your particular processor? If it's a sophisticated one, can we come up with a specially crafted code to measure the length of the internal branch-history kept for making inferences?

  25. Josh says:

    I've been doing this for years, though I don't even bother with the ternary, I just do:

    count += array[i] < boundary;

    I *think* that's portable under the C standard, and I've never seen it fail, but I'm willing to be contradicted. I doubt it compiles any differently; the mildly optimized assembly from the example ternary is the direct assembly translation of my ternary free approach.

    Short summary of article: Branches suck. Try to avoid them in huge loops if you can get the correct behavior using branchless non-boolean binary operators.

  26. Klimax says:

    @Myria

    On modern(since Pentium II?) self-modifying code is considered very bad idea and is severely penalized. (pipelines are flushed for example) also it can't be used because code pages are marked as read-only + executable. (NX bit, would require explicit change in permissions; almost nobody does it)

  27. IanBoyd says:

    Eric Brumer [MSFT] gave an excellent presentation at Going Native 2013 that talks about optimizing for the branch predictor:

    September 6, 2013

    Compiler Confidential  

    channel9.msdn.com/…/Compiler-Confidential

    The amazing part, to me, is that he actually busts out a micro-architecture diagram of a single core of a Haswell CPU, and traces the execution of the assembly. (23m mark)

    He has four talks on Channel 9. His talks on automatic vectorization, cache misses, store buffers, hot memory, really taught me a lot. I really like his speaking and presentation style. And he doesn't cop-out with some hand-waving arguments; he actually explains things all the way down, and backs it all up with numbers.

  28. Myria says:

    @Klimax: This would occur exactly once, at module load time, not every time an imported function is called.  It would be no worse than the code patches that already happen due to relocation fixups, which, by the way, already have to disable write protection on executable pages.  (This is why Windows Store applications can still get away with self-modifying code if they want, though Microsoft wouldn't approve them—relocations require the ability to patch the loaded executable.  iOS can block this only because their module loading system is based upon position-independent code rather than code patching.)

  29. Neil says:

    Some more information on branch prediction in processors can be found here (pp11-36):

    http://www.agner.org/…/microarchitecture.pdf

    It also has a section on register renaming where it mentions that the xor cx, cx only costs a decode cycle, so moving it outside of the loop probably makes no difference.

  30. Tringi says:

    @IanBoyd

    Eric also did an excellent talk at Build 2014 on FMA, AVX and store buffer performance: channel9.msdn.com/…/4-587

    I highly recommend it.

  31. Adam says:

    You can persuade VS 2012 to vectorize this by tweaking the code. It's not ideal though because it obfuscates the code significantly, and the subtract could overflow. However the performance gains compared to not vectorizing it are significant.

    count -= (array[i] – boundary) >> 31;

  32. Thomas says:

    For me, GCC 4.8.1 on Windows produces the fastest code with -O1. -O1 emits setl/addl/movzbl whereas -O3 uses leal/cmovl/addq. The code with setl is about 110 times faster on my Core i7. (24 QPC vs. 2848 QPC)

  33. Crescens2k says:

    @Myria:

    Could Window Store applications really get away with code patching? Remember part of the executable is the relocations section.

    This gives Windows enough information it needs to detect what areas are not supposed to change, and what areas are supposed to change and by how much. I.e. the relocations are supposed to change by the difference between the original base address and the current address that the executable is loaded in at.

    For the whole thing of using machine code to patch the import thunks, I doubt it would make much of a difference. The reasoning for this is special, that is most of the execution will be done either in your executable module or in the library itself. You could probably find cases where the call is in a tight loop and it makes a difference, but in general you are more likely to be CPU bound in a tight loop doing stuff on a memory address, IO bound, or even waiting for user input. This of course is personal opinion, and I could very well be totally wrong.

  34. Tringi says:

    @Thomas

    Make sure you use also at least -march=nehalem parameter, or better, set precisely which i7 architecture you have. It very much affects choice of instructions being emitted.

  35. Evan says:

    @Klimax: "On modern(since Pentium II?) self-modifying code is considered very bad idea and is severely penalized. (pipelines are flushed for example) also it can't be used because code pages are marked as read-only + executable. (NX bit, would require explicit change in permissions; almost nobody does it)"

    But almost every program does it (actually pretty close to Myria's suggestion) on ELF systems. The ELF equivalent to the __imp__foo functions on x86 consist of the following sequence conceptually:

     L1: jmp L2

     L2: push <index>

         jmp resolver

    The resolver function uses the index pushed in L2 to determine what function is being called and its address in the shared object, and then patch the jump target at L1 to point there. Future calls to foo still transfer to L1, but then immediately jump to the target function. (That also explains the effective nop at L1; it makes it "easier" to patch.)

    That means that if you're using a shared library on Linux for example, you're using self-modifying code.

    Similar to Myria's suggestion, this is a one-time cost for each function. (Myria's was one time at load time.) Subsequent calls are probably faster than Windows-style indirect calls.

    JITing runtimes are another case of code where you have to deal with changing page permissions. It's not a big deal in that application because it's used to bring performance benefits through having a JIT.

    In short, yes, one should think carefully before doing something with self-modifying or dynamically generated code. But it's also applicable much more than you seem to think.

  36. Klimax says:

    Wrong. Not a self modifying code. It is more of bunch of jumps with runtime dependency, but it still may explain why PIC is so slow… Classical self-modifying code replaces whole instructions not a variable. What you are posting is classical regular code. (Just few more indirections)

    See Intel Optimization guide section 3.6.9 Mixing code and data

    "The aggressive prefetching and pre-decoding of instructions by Intel processors have two related effects:

    • Self-modifying code works correctly, according to the Intel architecture processor requirements, but

    incurs a significant performance penalty. Avoid self-modifying code if possible."

    So, not a good counterexample. Not what I was talking about. Completely different things… Next attempt at example?

  37. Klimax says:

    Hm, section 3.6.9.1:

    "3.6.9.1 Self-modifying Code

    Self-modifying code (SMC) that ran correctly on Pentium III processors and prior implementations will run

    correctly on subsequent implementations. SMC and cross-modifying code (when multiple processors in a

    multiprocessor system are writing to a code page) should be avoided when high performance is desired.

    Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or

    fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page

    containing directly or speculatively executed code with another processor as a data page can trigger an

    SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is

    due to the self-modifying code condition.

    Dynamic code need not cause the SMC condition if the code written fills up a data page before that page

    is accessed as code. Dynamically-modified code (for example, from target fix-ups) is likely to suffer from

    the SMC condition and should be avoided where possible. Avoid the condition by introducing indirect

    branches and using data tables on data pages (not code pages) using register-indirect calls."

    Looks like I was too strict and implementation as you both refer to may satisfy SMC condition.

    Sorry, I may have been wrong on labeling code, but I was right about implications…

  38. Evan says:

    I'm not arguing that the fixup won't cause a performance hit when it occurs. What I'm saying is that the hit that the fixup code causes is a one-time cost (per function) and, once you incur that cost, the fixup-ed code may run faster than the Windows-style based on indirect jumps.

    How much faster? I don't know. Maybe it doesn't even run faster, though I'd bet a small amount that it would.) Is it enough to make up for the SMC hit? I don't know. But it's totally believable that, on the whole, it is. It's also believable that it isn't. Or that it depends on the CPU. All I'm arguing against is the categorical statement "SMC is bad don't use it." If the SMC buys you larger wins later in execution, then it's worthwhile.

    (Also, I think this choice of cross-module call implementation strategy is independent of PIC; off the top of my head, either the Windows-style indirect-jump strategy or the Linux-style fixup strategy could be used for both Windows-style non-PIC DLLs and Linux-style PIC SOs.)

  39. Sebastien Lorion says:

    @Adam: That's interesting! Do you have any idea why that version works and not the ones suggested by Raymond or Josh ?

  40. Thomas says:

    @Jan Ringoš: I tested both with and without -march=native -mtune=native. Didn't change the results.

  41. John Doe says:

    > count += (array[i] < boundary) ? 1 : 0;

    > … the optimizer was able to remove the branch from the ternary expression

    This is cheating.  Had it been something other than "1 : 0" or "0 : 1", the compiler would branch.  So, it's not branching itself you're avoiding, ? : is a branching operator in general.  

    The only portable way it wouldn't (shouldn't, it still may) branch would be relying on C boolean expressions yielding 0 for false and 1 for true, leaving only the comparison expression.

    You're relying on a compiler optimization.  The use of ? : is not a generalizable solution, it's actually very specific and unnecessary.

  42. Cheong says:

    I think this could affect SQL stored procedures as well.

    Recently, I was able to optimize a query using "cursor + IF statements to insert into temp table each line" around a few moderately large tables to version that uses subqueries with a few "CASE WHEN … ELSE … END" statements on select clause. The execution time reduced from around 1.5 minutes to 1 – 2 seconds, even if the execution plan show that my version actually created a large intermediate result set before the filtering in the end.

  43. Magnus says:

    I'm surprised that while many people have pointed out that the numbers aren't random because rand() starts with the same seed every time, no-one has pointed out that using rand() % 10 is bad style because the results aren't randomly distributed.  rand() is defined to return a number between 0 and RAND_MAX, which is usually 32767.  So a number from 0 to 7 is returned slightly more frequently than 8 and 9.  Yes it's a toy example but this is worth pointing out whenever you use random numbers in an educational article.

    When full speed optimisations are turned on, and you use C++ instead of C, the standard algorithms (std::count_if or std::accumulate) should always create code at least as good as an explicit loop.  It would be interesting to see how different compilers perform with this example.

    Finally, Josh at 13 Jun 2014 7:32 PM and Joshua at 13 Jun 2014 9:52 PM, I don't see why Josh's code is "bad form".  A 'bool' (in C++, and in C if it has it) is defined to be either 0 or 1, while a numeric is defined to return true if it is non-zero.  In fact there is a common idiom to convert a value into an explicit 1 or 0: { int b = !!x; }.  If x is 0, this will return 0.  If x is non-zero, it will return 1.

  44. Csaboka says:

    John Doe:

    > This is cheating.  Had it been something other than "1 : 0" or "0 : 1", the compiler would branch.

    Not necessarily. It is possible to compile a branchless evaluation of the ternary operator for any two constants around the colon. For example, this is how I would write assembly code for:

    count += (array[i] < boundary) ? 42 : 0;

    (using the assembly in the article as a starting point)

       mov edx, offset array ; start at the beginning of the array

    $LL3:

       xor ecx, ecx    ; start with zero

       cmp [edx], ebx  ; compare array[i] with boundary

       setnl cl        ; if less than boundary, then set ecx = 0, otherwise ecx=1

       dec ecx         ; if less than boundary, then ecx = -1, otherwise ecx=0

       or ecx, 42      ; if less than boundary, then ecx = 42, otherwise ecx=0

       add eax, ecx    ; accumulate result in eax

       add edx, 4      ; loop until end of array

       cmp edx, offset array + 40000

       jl $LL3

    If one of the values isn't zero, you can still use this trick with the difference of the two values, then add the smaller value unconditionally at the end. In other words, the compiler is only forced to branch if at least one expression around the colon has side effects, and as such, they cannot both be evaluated without violating the spec.

    Does the VS compiler do the trick I just described? I don't know, but I certainly hope so. The extra two instructions compared to the "1 : 0" case are still a lot cheaper than a branch.

  45. Klimax says:

    @Evan:

    You can't use it in Windows. Windows automatically mark any page with code as read-only + executable and thus no write into it is permitted. You'd have to have code with sufficient privilege to change permissions. (Loader IIRC does fix ups and then changes protection) And frankly, it is security nightmare.

    In  short: Use of SMC is not only strongly discouraged, but on Windows by default not possible at all. (Wihtout explicit change by VirtualProtect)

    Details on memory protection options: msdn.microsoft.com/…/aa366786(v=vs.85).aspx

  46. acq says:

    Csaboka, I've checked the output of VC 2010: it used something very much like your trick on 32-bits (but without making the error in the operator, "and" not "or") and more than that it unrolled the loop a little. On 64-bits it simply used cmovl. Very nice.

  47. Matt says:

    @Klimax: "You'd have to have code with sufficient privilege to change permissions".

    Every Windows process implicitly has PROCESS_VM_OPERATION permission on its own process. And loader relocation fixups are done in usermode (you mark the RX pages RW to do the fixup, then mark it back RX after you're done).

    Also, you can have sections in your binary that are RWX. The permissions are decided by the binary, not by Windows.

  48. Mike Dimmick says:

    @Joker_vD: QueryPerformanceCounter time incorporates any context switches. QPC is implemented using whatever hardware timer is available: the lowest cost timer available (in terms of clock cycles required to obtain it) which will produce consistent results.

    Historically it was implemented using the rdtsc instruction, but that gave incorrect results if different CPUs ticked at different rates or had been parked, and the thread migrated between different processors. It is now usually implemented with a separate hardware timer, if rdtsc cannot be relied upon. For details, see msdn.microsoft.com/…/dn553408(v=vs.85).aspx .

  49. Csaboka says:

    acq:

    > but without making the error in the operator, "and" not "or"

    D'oh! You are absolutely correct. Looks like my assembly skills are rusty after years of Java work…

    > On 64-bits it simply used cmovl.

    Yeah, CMOVcc is nice, but AFAIK it wasn't guaranteed to be present for quite a while on x86 CPUs, so people tried not to depend on them.

  50. smf says:

    @Myria

    "which, by the way, already have to disable write protection on executable pages.  (This is why Windows Store applications can still get away with self-modifying code if they want, though Microsoft wouldn't approve them—relocations require the ability to patch the loaded executable.  iOS can block this only because their module loading system is based upon position-independent code rather than code patching.)"

    It's nothing to do with relocations. On all operating systems with write protected code the image is loaded then it's relocated and finally it's write protected. Windows Store applications use a JIT compiler, which has to allocate ram, write the program to it and then execute it all the time. It's theoretically possible for pages to me marked as write protected after they are compiled, but I don't know if Microsoft do this as it would have an overhead (if the function you compiled doesn't exactly fill a page then you're wasting ram).

    iOS blocks it because they want every piece of code that is executed to be available to them when the application is submitted (an interpreter or script is also not allowed on iOS but they can't stop that).

  51. Evan says:

    @John Doe: "This is cheating.  Had it been something other than "1 : 0" or "0 : 1", the compiler would branch.  So, it's not branching itself you're avoiding, ? : is a branching operator in general. "

    No, not true. I even gave an example upthread about the compiler optimizing "if (…) x=10; else x=20;" into a branchless version, and compilers now generate code that's even better than when I first saw that. Now, if you stick a function call into one of the branches it won't be able to do anything*, but there is a wide range of values you could put in the branches. Perhaps any constant. Perhaps anything that doesn't require actual additional computation. Perhaps even things that *do* require additional computation if the compiler feels it would be more beneficial to compute it unconditionally and ignore the result if the condition is false.

    Obviously you won't avoid branches in many cases, but you'll avoid branches in more than you think.

    (* On x86: some other architectures, e.g. ARM, support predicating lots of instructions. ARM supports predicating calls. It would be totally possible (and believable it would be beneficial!) to compile "if (x==0) foo(); else bar();" to branchless object code (that would be "sub r0, 0; bleq foo; blne bar" if x is in r0), though at least the version of GCC I tried did not actually do so.)

    [What if foo returns with Z clear? -Raymond]

    @Klimax: "You can't use it in Windows. Windows automatically mark any page with code as read-only + executable and thus no write into it is permitted. You'd have to have code with sufficient privilege to change permissions. (Loader IIRC does fix ups and then changes protection)"

    Here's my summary of that post: "You can't do it. Here's how you do it."

    Remember: mucking about with page permissions to support SMC or other dynamically-generated code isn't rare. Every .NET program will do it. Every Java program will do it. Every program in any of the other several JITed languages will do it. *Every* program that loads DLLs is at least prepared to do it in case there is an address space conflict and the loader has to rebase the DLL. On many *nix systems, basically *every* program does it at function-call time in a fixup.

    Changing page protections isn't particularly rare, or an edge case. It's pretty common.

  52. carbon twelve says:

    @Sebastien Lorion:

    The way .NET works means that you can't be rid of branching by using this transformation.

    In C++, the array is static and of fixed size from compile-time and throughout the app's entire execution; whereas the C# array can be swapped out for another array of different size at any point, for example — on top of that, unlike in C++, the behaviour of going beyond the range of the array is defined, so whatever you do you end up with a few jumps to verify that you're within the array bounds when you're using it.

    I think that .NET Native makes some promises to make inroads into this problem with better static analysis.

    Also, I note that when I run your code on my machine with the x64 JIT, the "branchless" expression is implemented with a branch anyway — which now that I read your comment again is exactly what you found so sorry for being redundant.

  53. John Doe says:

    @Evan, point taken.

    On ARM, a mispredicted or not predicted branch, be it a conditional call or whatever, will still flush the pipeline, there's no silver bullet here.

Comments are closed.