Optimization is often counter-intuitive


Anybody who’s done intensive optimization knows that optimization is often counter-intuitive. Things you think would be faster often aren’t.

Consider, for example, the exercise of obtaining the current instruction pointer. There’s the naïve solution:

__declspec(noinline)
void *GetCurrentAddress()
{
  return _ReturnAddress();
}

...
void *currentInstruction = GetCurrentAddress();

If you look at the disassembly, you’ll get something like this:

GetCurrentAddress:
    mov eax, [esp]
    ret

...
    call GetCurrentAddress
    mov [currentInstruction], eax

“Pah,” you say to yourself, “look at how inefficient that is. I can reduce that to two instructions. Watch:

void *currentInstruction;
__asm {
call L1
L1: pop currentInstruction
}

That’s half the instruction count of your bloated girly-code.”

But if you sit down and race the two code sequences, you’ll find that the function-call version is faster by a factor of two! How can that be?

The reason is the “hidden variables” inside the processor. All modern processors contain much more state than you can see from the instruction sequence. There are TLBs, L1 and L2 caches, all sorts of stuff that you can’t see. The hidden variable that is important here is the return address predictor.

The more recent Pentium (and I believe also Athlon) processors maintain an internal stack that is updated by each CALL and RET instruction. When a CALL is executed, the return address is pushed both onto the real stack (the one that the ESP register points to) as well as to the internal return address predictor stack; a RET instruction pops the top address of the return address predictor stack as well as the real stack.

The return address predictor stack is used when the processor decodes a RET instruction. It looks at the top of the return address predictor stack and says, “I bet that RET instruction is going to return to that address.” It then speculatively executes the instructions at that address. Since programs rarely fiddle with return addresses on the stack, these predictions tend to be highly accurate.

That’s why the “optimization” turns out to be slower. Let’s say that at the point of the CALL L1 instruction, the return address predictor stack looks like this:

Return address
predictor stack:
  caller1 -> caller2 -> caller3 ->
Actual stack:   caller1 -> caller2 -> caller3 ->

Here, caller1 is the function’s caller, caller1 is the function’s caller’s caller, and so on. So far, the return address predictor stack is right on target. (I’ve drawn the actual stack below the return address predictor stack so you can see that they match.)

Now you execute the CALL instruction. The return address predictor stack and the actual stack now look like this:

Return address
predictor stack:
  L1 -> caller1 -> caller2 -> caller3 ->
Actual stack:   L1 -> caller1 -> caller2 -> caller3 ->

But instead of executing a RET instruction, you pop off the return address. This removes it from the actual stack, but doesn’t remove it from the return address predictor stack.

Return address
predictor stack:
  L1 -> caller1 -> caller2 -> caller3 ->
Actual stack:   caller1 -> caller2 -> caller3 -> caller4 ->

I think you can see where this is going.

Eventually your function returns. The processor decodes your RET instruction and looks at the return address predictor stack and says, “My predictor stack says that this RET is going to return to L1. I will begin speculatively executing there.”

But oh no, the value on the top of the real stack isn’t L1 at all. It’s caller1. The processor’s return address predictor predicted incorrectly, and it ended up wasting its time studying the wrong code!

The effects of this bad guess don’t end there. After the RET instruction, the return address predictor stack looks like this:

Return address
predictor stack:
  caller1 -> caller2 -> caller3 ->
Actual stack:   caller2 -> caller3 -> caller4 ->

Eventually your caller returns. Again, the processor consults its return address predictor stack and speculatively executes at caller1. But that’s not where you’re returning to. You’re really returning to caller2.

And so on. By mismatching the CALL and RET instructions, you managed to cause every single return address prediction on the stack to be wrong. Notice in the diagram that, in the absence of somebody playing games with the return address predictor stack of the type that created the problem initially, not a single prediction on the return address predictor stack will be correct. None of the predicted return addresses match up with actual return addresses.

Your peephole optimization has proven to be shortsighted.

Some processors expose this predictor more explictly. The Alpha AXP, for example, has several types of control flow instructions, all of which have the same logical effect, but which hint to the processor how it should maintain its internal predictor stack. For example, the BR instruction says, “Jump to this address, but do not push the old address onto the predictor stack.” On the other hand, the JSR instruction says, “Jump to this address, and push the old address onto the predictor stack.” There is also a RET instruction that says, “Jump to this address, and pop an address from the predictor stack.” (There’s also a fourth type that isn’t used much.)

Moral of the story: Just because something looks better doesn’t mean that it necessarily is better.

Comments (25)
  1. Eric Duran says:

    Second moral of the story: Leave optimization to Raymond Chen and his team of rocket scientists so we, developers, can focus on higher abstraction layers. Thumbs up for .NET and J2EE.

  2. Ben Hutchings says:

    Silly Intel processors. That should be just:

    lea (next,pc),a0

    next:

    Quite why you’d need to do this, though, I don’t know. Normally you can just use absolute addresses and let the linker and loader adjust them as necessary. It might be useful for debug tracing, but then you’d normally want to get the caller’s address, not your own, and it’s probably not performance-critical anyway.

  3. T says:

    Ben,

    http://www.insecure.org/stf/smashstack.txt

    Classic article, BTW. There’s a bit where they want to get the address of some data. Since the jmp and call are both pc-relative, the code is position independant, which it needs to be.

    jmp call_insn

    popl_insn:

    popl %esi

    ; esi points to the data

    call_insn:

    call popl_insn

    data:

  4. My first thought on reading this was "why not just mov currentip, eip". The answer is "you can’t do that on the x86; eip is hidden". Then I ask myself, "self, then what about mv currentip, $?" Then I realize that $ is just a cheap trick of syntax, to explicitly make it use rel form, and probably only works on the LHS of a mov, or inside of brackets.

    So, this post boils down to "x86 silly!"

  5. Eric Lippert says:

    This lesson applies even more to "virtual machine" worlds, because there are way more assumptions that clever machine implementors can make about likely code paths.

    For example, "everyone knows" that you should never do this

    for i = 1 to array.length



    next

    Because of course that recalculates collection.length every time, even if it’s unlikely to change. Instead, you should do

    mylength = array.length

    for i = 1 to mylength



    next

    But in the .NET world, the compiler that generates the IL and the jitter that jits the IL can be very smart when analyzing the original program, recognizing this common pattern, and emitting code that more optimally enumerates an array.

    By trying to prematurely optimize the loop, there are situations in which you can actually make it worse.

  6. T: I was confining my thinking to legitimate uses – and ones that are performance-critical!

  7. michaels says:

    Another issue for this particular optimisation to be slow is that you’ve done it in c (or c++) with embedded assembly … afair the compiler won’t give the optimal code around your smart segment there, hence causing the slowdown.

    and why does your inline asm use "call" ?

    Couldn’t you just do:

    __asm {

    iwanteip:

    lea eax, iwanteip

    ret

    }

    Eric is right about optimisation in JIT’d languages though … especially ones like java’s "HotSpot".

  8. I found an interesting issue when I was working on MSLU, where I wanted to make sure that if somebody called the wrapper APIs and they were on NT that it would call right through to the operating system. My initial attempt was:

    int __stdcall

    AddFontResourceW(LPCWSTR lpszFilename)

    {

    if(IS_ON_NT())

    {

    __asm jmp AddFontResourceW

    but it turned out that 500+ APIs like this were not only slower but also much fatter in the final DLL size than

    int __stdcall

    AddFontResourceW(LPCWSTR lpszFilename)

    {

    if(IS_ON_NT())

    {

    return AddFontResourceW(lpszFilename);

    It made me almost retire from trying to be clever with assembly language, wondering how a jmp could be slower. Until I looked at the disassembly and noticed that it was being optimized to the jmp anyway, but the ASM caused other optimizations to not happen.

  9. Raymond Chen says:

    "Couldn’t you just do ‘lea eax, iwanteip’" – Sure you can do that, but the presence of "ret" suggests that you’re writing the implementation of GetReturnAddress(), which returns not the return address but rather &GetReturnAddress itself, which isn’t very useful.

    If you meant it to be inlined into the caller function, then you run into the problem of inline assembly disabling a large class of optimizations…

  10. Zahical says:

    "Hidden Variables". Very nice :-)

    Are you now into Quantum mechanics, Raymond?

  11. T says:

    On x86-64, I think you can do

    lea rax, [rip]

    Also, RIP relative is the default addressing mode – it’s shorter than encoding 64 bit absolute addresses in the instruction, so presumably the loader won’t have to do many relocations compared to x86.

    That’s actually quite an advantage, because it means that pages containing code can be shared between two processes, even if they are loaded at different addresses.

  12. Keith Sader says:

    Wow, reading this reminds me of Micheal Abrash who once said – ‘Measure first, then optimize’ – The Zen of Code Optimization

  13. Marcel says:

    Thanks a lot Raymond. I’ve written a 68k emulator which was originally targeted at the 486 and therefore was written in pure assembler with every trick in the books to save a few cycles. Many of those optimizations have been removed in due time, but the main loop still looked something like this:

    Loop: PUSH ReturnAddress

    Decode command

    JMP Command

    Commands return with RTS. Most of the time ReturnAddress pointed to the loop, but if e.g. an interrupt had to be simulated it pointed to the IRQ handling code. With your entry you got me thinking and lo and behold, the new version

    Loop: Decode command

    CALL command

    JMP ReturnAddress

    increases the overall speed on a P4 by 30%! Pretty amazing for a less than 10 bytes change…

  14. Anonymous says:

    Gerd Riesselmann » Optimization, Templates And The Rest

  15. Re: Eric Lippert:

    The problem is that the optimization isn’t just an optimization, it’s a clarification in semantics. If the compiler can’t prove that array.length won’t change in the body of the loop, it has to assume it may and cannot make the optimization you’re suggesting.

    What’s more interesting is that actually, based on GrantRi’s blog, it’s safer to always do the length capture since the newer optimizers realize when it would be better to re-read it as long as the value can’t be seen to be changed on the current thread.

    Really, this is why a "for each" type of construct is really best which makes the question unambiguous.

    (What’s interesting is when the optimizer starts to be free to /not/ capture values when asked to, implementing the necessary logic for capturing client supplied data when cross security boundaries gets hard.)

    My first exposure to Raymond’s general principle was back at Digital working on the VAX. Too much time was being spent in the heap allocator so I was trying to optimize the search of the free list when the allocation was too large to be on one of the lookaside lists. I used assembly language to make the function shorter, use fewer registers (and therefore preserve fewer registers on calls) etc. etc. and my implementation ended up being like twice as slow as what the VAX C compiler generated.

    I gave up trying to outguess the Digital optimizers then. The MSFT optimizers are almost to that point now and between LTCG and a new generation of optimizers coming up, it’s even more important to specify the semantics you want and trust the optimizer.

    (IMO, the next generation of problem is going to be code size optimizations due to proliferation of templates/generics…)

  16. michaels says:

    Really, this is why a "for each" type of

    > construct is really best which makes the

    > question unambiguous.

    Which question ? Whether the runtime needs to re-read the "length" variable each time ?

    Of course it’s still possible to modify the structure while in a "for each" loop. Or have I misunderstood you ?

  17. Norman Diamond says:

    12/18/2004 8:31 PM Michael Grier [MSFT]

    > I used assembly language to make the

    > function shorter, use fewer registers (and

    > therefore preserve fewer registers on calls)

    > etc. etc. and my implementation ended up

    > being like twice as slow as what the VAX C

    > compiler generated.

    Did you find out why? I’ve read that some of the emulated instructions were slower to execute using the built-in hardware emulator than using a hand-coded sequence of simpler and faster instructions. If you were coding before that discovery (if that’s possible), and if the VAX C compiler generated simpler code that just happened to be faster because of that reason, then you don’t need to worry about outguessing. Otherwise I’d be studying the generated code to see what I had done wrong.

    The most second-most-recent time I had to code assembly language was log2 and exp2 functions for a device driver for x86 Linux. gcc can be told to avoid using floating point instructions but Linux kernel experts said that would be a bizarre approach. It turns out that actual floating point operations can be used in the x86 Linux kernel if preceded by some special macro (in C/gcc/Linux kernel) and followed by some other special macro (same situation). So then I could either call C’s built-in functions for log and exp and do the appropriate scaling, or write my own assembly language. But even though I had not told gcc to avoid using floating point instructions or to use an emulator or anything like that, it generated function calls. The x86 assembly language for log2 is a single instruction, though exp2 requires a ton of scaling before and after its central instruction. So I coded asm’s. I sure hope I wasn’t slower than gcc’s function calls.

    But then we had to port it to MIPS. Floating point in the MIPS Linux kernel is never allowed. Fortunately we could precompute the results for a known set of values and dispense with run-time floating point entirely.

    I was afraid I was going to have to repeat that with trig functions, but they turned out to be unnecessary.

    (As for why that had to be done in the device driver, one famous foreign government prohibits exposing certain functionality to control by end users.)

    The third-most-recent time I had to code assembly language was also for x86. That time I grabbed a couple of crutches, unnecessary but very convenient: I looked at the assembly language generated by the compiler, figured out how to cut out about three quarters of the instructions, and the result really was four times as fast. That was about 20 years ago though.

    (The absolute most recent time I had to code assembly language was for a Hitachi chip. Hitachi’s C and C++ compilers don’t provide all of the necessary choices regarding which registers to push and pop in interrupt handlers, so I didn’t have any competition for speed on that one ^_^)

  18. Sometimes it’s all about paging.

  19. Mike Swaim says:

    for i = 1 to array.length



    next

    is only faster for arrays. If you use an ArrayList, then it’s faster to cache the length in a stack variable. (And a for… statement’s better than foreach, because foreach creates a hidden iterator.)

    Way back in the ’80s Creative Computing had a column by one of the names in game design. One of his rules was to precompute everything (especially bitmaps that’d have to be shifted) because memory was fast, and processors were slow.

  20. Optimizing C code is a bit like designing a web page – you never know what kind of browser a web page is viewed with- You may compile your C code on a different platform. Same is true for assembly code. Much of the old 8086 code ran slower on the x286 because of "optimizations" done to the x86 that used the clock cycle info when the code was created.

    The classic optimization problem in C can be demonstrated by comparing recursive code to iterative code. For example, write the factorial function using recursion, compare that to the factorial function using iteration. The recursion function only has 3 or 4 lines of code. You’d think that it was WAY faster than iterative code which is several lines longer, and also has a loop variable. What a suprise for the CS freshman!

  21. Tom Canham says:

    "Optimize late"

    Why is that so hard? I think sometimes that we coders are more eager to show off our "l33t" CS101 skills and hand-code some ASM than to trust that Someone Else (eg, compiler writers) Might Actually Know What They’re Doing.

    Of course, once you’ve written your naive version, profiled it, and determined exactly where the bottlenecks are — that’s a completely situation. Go to town, with the blessings of all on this thread :)

  22. Converting the file as we read it is taking a lot of time.

Comments are closed.