The history of calling conventions, part 4: ia64

The ia-64 architecture (Itanium) and the AMD64 architecture (AMD64) are comparatively new, so it is unlikely that many of you have had to deal with their calling conventions, but I include them in this series because, who knows, you may end up buying one someday.

Intel provides the Intel® Itanium® Architecture Software Developer's Manual which you can read to get extraordinarily detailed information on the instruction set and processor architecture. I'm going to describe just enough to explain the calling convention.

The Itanium has 128 integer registers, 32 of which (r0 through r31) are global and do not participate in function calls. The function declares to the processor how many registers of the remaining 96 it wants to use for purely local use ("local region"), the first few of which are used for parameter passing, and how many are used to pass parameters to other functions ("output registers").

For example, suppose a function takes two parameters, requires four registers for local variables, and calls a function that takes three parameters. (If it calls more than one function, take the maximum number of parameters used by any called function.) It would then declare at function entry that it wants six registers in its local region (numbered r32 through r37) and three output registers (numbered r38, r39 and r40). Registers r41 through r127 are off-limits.

Note to pedants: This isn't actually how it works, I know. But it's much easier to explain this way.

When the function wants to call that child function, it puts the first parameter in r38, the second in r39, the third in r40, then calls the function. The processor shifts the caller's output registers so they can act as the input registers for the called function. In this case r38 moves to r32, r39 moves to r33 and r40 moves to r34. The old registers r32 through r38 are saved in a separated register stack, different from the "stack" pointed to by the sp register. (In reality, of course, these "spills" are deferred, in the same way that SPARC register windows don't spill until needed. Actually, you can look at the whole ia64 parameter passing convention as the same as SPARC register windows, just with variable-sized windows!)

When the called function returns, the register then move back to their previous position and the original values of r32 through r38 are restored from the register stack.

This creates some surprising answers to the traditional questions about calling conventions.

What registers are preserved across calls? Everything in your local region (since it is automatically pushed and popped by the processor).

What registers contain parameters? Well, they go into the output registers of the caller, which vary depending on how many registers the caller needs in its local region, but the callee always sees them as r32, r33, etc.

Who cleans the parameters from the stack? Nobody. The parameters aren't on the stack to begin with.

What register contains the return value? Well that's kind of tricky. Since the caller's registers aren't accessible from the called function, you'd think that it would be impossible to pass a value back! That's where the 32 global registers come in. One of the global registers (r8, as I recall) is nominated as the "return value register". Since global registers don't participate in the register window magic, a value stored there stays there across the function call transition and the function return transition.

The return address is typically stored in one of the registers in the local region. This has the neat side-effect that a buffer overflow of a stack variable cannot overwrite a return address since the return address isn't kept on the stack in the first place. It's kept in the local region, which gets spilled onto the register stack, a chunk of memory separate from the stack.

A function is free to subtract from the sp register to create temporary stack space (for string buffers, for example), which it of course must clean up before returning.

One curious detail of the stack convention is that the first 16 bytes on the stack (the first two quadwords) are always scratch. (Peter Lund calls it a "red zone".) So if you need some memory for a short time, you can just use the memory at the top of the stack without asking for permission. But remember that if you call out to another function, then that memory becomes scratch for the function you called! So if you need the value of this "free scratchpad" preserved across a call, you need to subtract from sp to reserve it officially.

One more curious detail about the ia64: A function pointer on the ia64 does not point to the first byte of code. Intsead, it points to a structure that describes the function. The first quadword in the structure is the address of the first byte of code, and the second quadword contains the value of the so-called "gp" register. We'll learn more about the gp register in a later blog entry.

(This "function pointer actually points to a structure" trick is not new to the ia64. It's common on RISC machines. I believe the PPC used it, too.)

Okay, this was a really boring entry, I admit. But believe it or not, I'm going to come back to a few points in this entry, so it won't have been for naught.

Comments (35)
  1. Anonymous says:

    What is the size of the ‘red zone’ under Win32? I’ve often used this in assembly functions without even considering that it might be illegal.

  2. Anonymous says:

    The "red zone" exists only on ia64. Don’t try it on any other platform or you’ll corrupt the stack!

  3. Anonymous says:

    What is the "red zone?"

  4. Anonymous says:

    "One curious detail of the stack convention is that the first 16 bytes on the stack (the first two quadwords) are always scratch. (Peter Lund calls it a "red zone".)"

  5. Anonymous says:

    ??????,??????????? ?????????? calling convention: The history of calling conventions, part 1 The history of calling conventions, part 2 The history of calling conventions, part 3 The history of calling conventions, part 4: ia64 Why do member functions need to be…

  6. Anonymous says:

    Raymond, just to make sure we’re talking about the same thing. Is this illegal:

    function entry:

    push ebx

    push esi

    mov [esp-4], eax ; save eax temporarily

    ; do some stuff

    mov eax, [esp-4]

    ; do some more stuff

    pop esi

    pop ebx

    I believe that this should work as long as Windows saves more of the stack than just up to and including the stack pointer when doing context switches. Am I wrong?

  7. Anonymous says:

    Accessing memory at negative offsets from ESP is illegal. (The "red zone" for ia64 lives at positive offsets relative to ESP.)

    If an exception is raised in "do some stuff", you will likely find that your secret hiding place for EAX got overwritten by the exception handler.

  8. Anonymous says:

    I see. Of course I wouldn’t expect an exception handler (or any function call for that matter) to preserve this space, but neither would I set up an exception handler in an assembly function like this, so that hardly matters.

    But, before I rush off to rewrite all my assembler stuff ;-), does Windows currently preserve an area below the stack pointer when doing context switches (which is the only problem I can think of)?

  9. Anonymous says:

    Perhaps you won’t set up an exception handler, but your caller might have one, and the caller might decide to fix the exception and then return EXCEPTION_CONTINUE_EXECUTION. (For example, accessing [esp-4] might trigger a guard page exception, which is handled by kernel.) Execution then resumes and your stack is corrupted.

    Context is processor state (registers, memory map), not memory values. A context switch changes the processor’s view of the world, but the world doesn’t change.

  10. Anonymous says:

    If a function has a float parameter, is that one still passed via the integer registers too? (yeah I know I really should read the itanium manual)

  11. Anonymous says:

    I didn’t think it was boring. (BTW, the PPC, assuming you mean PowerPC, *is* a RISC machine.)

  12. Anonymous says:

    All architectures have separate rules for floats, which I have been ignoring throughout this series since they aren’t really relevant to my point. (I have a point?)

    When I said, "It’s common on RISC machines. I believe the PPC used it, too." I meant "It’s common on RISC machines. For example, the PPC is a RISC machine and it uses this method too." Sorry about that.

  13. Anonymous says:

    If you’re really crazy you can write Win32 x86 assembly code with a "4GB red zone," by temporarily recycling ESP as an eighth general purpose register for an inner loop that doesn’t trip exceptions or access the stack. You can even stash ESP in the structured exception handling chain to make the routine reentrant. Uh, not that I’ve ever written code that did this….

  14. Anonymous says:

    Yup, I’ve seen people do this (use ESP as a general purpose register) – you’re playing with fire with this trick, since (as noted) the slightest mis-step and you’re toast. I’ve only seen it in intense graphics code which is trying to squeak that last cycle out of an image processing algorithm. Not for the faint of heart!

  15. Anonymous says:

    Thanks for your comments, Raymond. I’ve used the [esp-x] trick in an arbitrary precision floating-point library that I’ve developed that needed to be as fast as possible. So using EBP as a general purpose register was important here.

    When I get around to it, I will correct the code to adjust ESP before storing temp variables so they’re stored in a ‘safe’ area.

  16. Anonymous says:

    "This has the neat side-effect that a buffer overflow of a stack variable cannot overwrite a return address since the return address isn’t kept on the stack in the first place."

    So this means the end of buffer overflow attacks?

    I have wondered for a long time while the stack is still upside down nowadays. In the olden days this made sense: the data area and the stack area grow towards each other. But with today’s virtual memory i don’t see any sense in this anymore.

    If i remember it correctly, the intel processers now have a flag to indicate the direction of the stack.

    Why has this not been changed yet?

    I know this would cause some compatibility issues, but wouldn’t it be possible to reverse the stack direction only for new programs?

  17. Anonymous says:

    "some compatibility issues" is quite an understatement.

    Stack reversal requires that ALL code in the program (including DLLs that may be outside your control) be aware of the reversal. Changing the stack direction is a fundamental change to the ABI, since parameters are now at positive offsets, the "sub esp, nn" needs to change to "add esp, nn" etc.

    This means among other things that

    1. there would need to be two copies of every DLL in the OS, one compiled with stack-up (for stack-up programs to use) and one with stack-down (for stack-down programs to use).

    2. you couldn’t inject a stack-up DLL into a stack-down process or vice versa

    3. a program that hosts plug-ins (like Explorer or IE) would have to choose between being stack-up (new style – but old shell extensions will no longer work) or stack-down (old style – new shell extensions will not work), and once it chose it would be restricted to shell extensions that were compiled with a compatible stack direction.

    4. I can’t actually find this "stack-up" bit in my Intel docs.

  18. Anonymous says:

    These problems could be easily overcome.

    I’m sure you know how intel code is executed on Alpha machines by converting on the fly, and then caching the resulting code.

    It would be easy to do the same.

    However, the real problem is: Is it worth the effort? The answer to that is probably "no".

    I just looked it up, the bit i remembered from long ago is the "expansion direction" bit in a segment descriptor (bit 2). This also means that a solution could be switching to a different stack when changing between stack up and stack down. This could be implemented in the memory manager by putting them in a different segment and generating a page fault that triggers a switch.

    Just an idea, i know it will never happen. But this could have solved the buffer overflow problem long ago.

    But it’s also very well possible that i misunderstood the meaning of this bit completely.

  19. Anonymous says:

    Um, expand-down and expand-up selectors are irrelevant here since Win32 is flat model, not segmented. They do not affect the meaning of the "push" instruction. They indicate whether the limit value is treated as a lower limit or an upper limit.

    E.g., if you specify that selector 013F has a limit of 0x10000 and it is expand-up, then valid addresses are 013F:00000000 through 013F:00010000. Whereas if you mark it as expand-down, valid addresses are 013f:00010000 through 013F:FFFFFFFF.

    But the "push" instruction always decrements ESP regardless of whether SS is an expand-up or expand-down selector.

    Recompiling code on the fly doesn’t help. The actual stack layout changed. Consider the following code fragment:

    push eax

    mov eax, esp

    mov eax, [eax+4]

    mov eax, [eax+4]

    this would have to be translated into

    add esp, 4 ! mov [esp], eax

    mov eax, esp

    mov eax, [eax-4]

    mov eax, [eax+4]

    Notice that the last two instructions – even though completely identical – had to be translated differently, because in the first case, eax points into the stack in an attempt to walk "up" it, but in the second case, eax is now a pointer to a structure (not a stack walk) so no inversion occurs.

    So any converter would have to figure out whether any particular memory reference was an attempt to walk up the stack or whether it is just a structure member access. This is semantic information that is not available in raw binaries.

    And all this to accomplish what?

  20. Anonymous says:

    Ok you’re right i got this all wrong. Sorry for that. I had noticed this bit long ago when ni was more involved in low level programming, and the assumption that it would actually influence instructions like push and ret was most likely wishful thinking.

  21. Anonymous says:

    How does this calling convention treat variadic functions?

  22. Anonymous says:

    I didn’t mention that the Win32 calling convention for ia64 passes only the first eight parameters in registers; the rest go on the stack. If a function is variadic, you call it like a normal function, but the function itself spills the first eight input registers (r32 through r39) onto the stack next to any possible parameters 9 and upwards, and then it treats the parameters as one giant array. This spilling needs to be done carefully to avoid a problem that I will discuss on Monday.

  23. Anonymous says:

    As for floating point parameters, you should go read the Itanium manuals. They explain a lot better than I can.

    Basically, when you have a floating-point parameter, the calling convention specifies that you should use registers f8, f9, … till f16, in order. At the same time, a "hole" is left in the integer registers.

    So, if you have a function with 4 parameters, in which the third is a floating-point one, parameters 1, 2 and 4 will be in registers out0, out1 and out3, whereas the third parameter will be in f8. out2 will be left with garbage.

    However, if the compiler cannot be completely sure that the called function expects a floating-point parameter (see the other article on misdeclaring symbols), it’s supposed to pass the parameter in BOTH registers.

    If, however, the 8-integer-register limit has been reached, then even floating points will go through the stack, even if there are free floating point registers.

    PS: there are worse things possible, like having an 80-bit extended floating point have to be passed in integer registers, because it would require 2 of them.

  24. Anonymous says:

    Making the stack go the other direction wouldn’t eliminate buffer overflow attacks anyway. When you call something like strcpy to fill a buffer on the stack, there’s a return address on either side of it.

  25. Anonymous says:

    Excellent point, Josh. It isn’t the fact that the return address stack grows in the opposite direction that protects the ia64 from stack overflow return address smashing attacks. It’s the fact that the return addresses aren’t kept on the stack in the first place.

  26. Anonymous says:

    Who says you need to stick to just one stack?

    return addresses could go on one stack, small parameters/local variables (bools, chars, ints, pointers) could go on another, and big parameters/local variables (buffers/structs) could go on a third or be heap allocated.

    (okay, this is sort of what the IA64 does with the hardware register stack – but you don’t need the IA64 to implement something like this)

  27. Anonymous says:

    Peter: Yes, you could use multiple stacks but there are some serious problems with this.

    1. Having to recompile ALL code to conform to this new stack scheme. You couldn’t mix an "oldstyle stack" caller with a "newstyle stack" callee. It would probably be unreasonable to ship a version of Windows that was 100% incompatible with the previous version of Windows… (See previous discussion with Henk.)

    2. The paucity of registers on the x86 makes it a hard sell to lose one of its precious few registers as an "alternate stack register", much less TWO of them!

  28. Anonymous says:

    The history of calling conventions

  29. Anonymous says:

    The Itanium has two stacks, so don’t assume that there’s only one.

Comments are closed.