The Itanium processor, part 4: The Windows calling convention, leaf functions


Last time, we looked at the general rules for parameter passing on the Itanium. But those rules are relaxed for leaf functions (functions which call no other functions).

Before we start, I need to correct some of the explanation I had given when introducing the calling convention. I used that explanation because it makes for an easier conceptual model, but the reality is slightly different.

First of all, I said that the alloc function shuffles the registers around and lays out the new local region and output registers. In reality, it is the br.call instruction that moves the registers and the alloc which sets up the register frame. Since the first instruction of a function is alloc, it doesn't make much difference how the work is distributed between the br.call and the alloc since they come right after each other. The only time you notice the difference is if you happen to break into the debugger immediately between those two instructions.

More precisely, here's what the br.call instruction does:

  • Copy the current register frame state (and some other stuff) to the pfs register.

  • Rotate the registers so that the first output register is now r32.

  • Create a new register frame as follows:
    • input registers = caller's output registers
    • no local registers
    • no output registers
    • no rotating registers

    Registers are spilled if necessary to make room.

  • Set the rp register to the return address.
  • Transfer control to the target.

In other words, the register stack changes like this:

r32 r33 r34 r35 r36 r37 r38 r39 r40 r41 r42 r43
f's Input f's Local f's Output  Before f does a br.call
 
r32 r33 r34
On entry to g g's Input
 
r32 r33 r34 r35 r36 r37 r38 r39 r40 r41 r42
After g does an alloc g's Input g's Local g's Output

A consequence of this division of labor between br.call and alloc is that leaf functions can take advantage of this default register frame: If a leaf function can do all its work with just

  • its input registers
  • scratch registers
  • the red zone

then it doesn't need to perform an alloc at all! It can use the default register allocation of "Caller's output registers become my input registers, and I have no local registers or output registers." When finished, the function just does a br.ret rp to return to the caller.

Note that this optimization is available only to leaf functions. If the function calls another function, then the br.call will overwrite the pfs and rp registers, which will make it hard to return to your caller when you're done.

The red zone is officially known as the scratch area. The first 16 bytes on the stack are available for use by the currently executed function. If you want values to be preserved across a function call, you need to move them out of the scratch area, because they become the scratch area for the function being called! In other words, the scratch area is not preserved across function calls.

A more obscure consequence of this division of labor between br.call and alloc is that a function could in principle perform alloc more than once in order to change the size of its local region or the number of output registers. For example, a function might start by saying, "I have five local registers and two output registers," and then later realize, "Oh, wait, I need to call a function with six parameters. I will issue a new alloc instruction that requests five local registers and six output registers." Although technically legal, it doesn't often occur in practice because it's usually easier just to set up your register state once and stick with it for the duration of the function.

A more common case where this occurs is when a function has an early exit that can be determined using only leaf-available resources.

extern HANDLE LogFile;

void Log(char *message, char *file, int line)
{
 if (!LogFile) return;
 ... complicated logging code goes here ...
}

If profiling feedback indicates that logging is rarely enabled, then the compiler can avoid setting up all the registers and stack for the complicated logging code, at least until it knows that logging is enabled.

.Log:
      addl    r30, -205584, gp ;; // get address of LogFile variable
      ld8     r30, [r30] ;;       // fetch the value
      cmp.eq  p6, p0 = r30, r0    // is it zero?
(p6)  br.ret  rp                  // return if so

  // Okay, we are really logging. Set up our stack.
      alloc   r35 = ar.pfs, 3, 5, 6, 0 // set up register frame
      sub     sp = sp, 0x240      // set up stack buffers
      mov     r36 = ra            // save return address

      ... do complicated logging ...

      mov     rp = r36            // return address
      mov.i   ar.pfs = r34        // restore pfs
      br.ret.sptk.many  rp ;;     // return to caller

The first instruction calculates the effective address of the Log­File variable. We'll learn more about the gp register later.

The second instruction loads an 8-byte value from that address, thereby retrieving the value of Log­File.

The third instruction compares the value against r0, which is a hard-coded zero register. It asks for an equality comparison, putting the answer in the predicate variable p6 (and putting the complement of the answer in p0, which effectively throws it away).

The fourth instruction conditionally returns from the function if the comparison was true. In the common case where logging is not enabled, the function returns at this point. Only if logging is enabled do the alloc and related instructions execute to set up the stack frame and then perform the complicated logging.

This is an example of an optimization known as shrink-wrapping. Shrink-wrapping occurs when a function does some work with a temporary stack frame, and then expands to a full stack frame only if it is needed. (Shrink-wrapping entails a few extra entries in the unwind exception table because the unwinding needs to take place differently depending on where in the function the exception occurred. I'll spare you the details.)

Okay, that's all for leaf functions and getting to the bottom of the whole br.call / alloc dance. Next time, we'll look at function pointers and the funky gp register. Here's something to whet your appetite: On ia64, a function pointer is not the address of the first instruction in the function's code. In fact, it's nowhere near the function's code.

Comments (10)
  1. Maybe I'm missing something here, but what enforces a program from writing to registers that aren't within it's local region or output?  Couldn't a malicious function manipulate values in earlier registers and then unwind?

  2. anonymouscommenter says:

    Something about an airtight hatchway…

    A malicious function could do /anything it wants to/.

  3. anonymouscommenter says:

    I suppose in principal a function could not use alloc and still call a function. All it has to do is push pfs and rp. Good luck unwinding that though.

    [In fact, it's nowhere near the function's code.]

    That would make trampolines ugly.

  4. anonymouscommenter says:

    If you want hardware-enforced security across function calls, you need something like the iAPX 432. It was a huge failure – much more so than Itanium.

  5. anonymouscommenter says:

    Maybe I'm missing something, but I cannot see where br.call might have to spill anything ever.

    Did you perhaps fail to separate it cleanly from the alloc instruction?

    [Oops, you're right. The spilling occurs at the alloc. -Raymond]
  6. anonymouscommenter says:

    I have to imagine that in the event of a tail call optimization, a new alloc would need to take place. Or does the caller undo its alloc before doing the optimized tail call? I couldn't find any information on what it takes to eliminate a tail call on IA64.

  7. anonymouscommenter says:

    Joshua: Itanium function pointers point to a small object with two fields. One field is the actual code pointer, the other is the global pointer for that function. Calling a function through a pointer involves loading the global pointer in the gp register, and branching to the code pointer. So everything works pretty much the same, except with an extra level of indirection. I forget if it was MIPS or PowerPC that had a similar ABI

  8. anonymouscommenter says:

    Gabe: A tail call doesn't necessarily have to jump to the entry point of a function. I'd have thought a jump to the instruction just after the usual prolog would be correct – e.g. in Raymond's example, if logging is (for some reason) using a tail call, the jump would be to the first instruction of …do complicated logging…

  9. anonymouscommenter says:

    AMD64 also has a red zone, at least in AMD's original ABI. I don't recall if Microsoft's AMD64 ABI kept it.

    [Yup, still there. -Raymond]
  10. Harry Johnston says:

    It seems to me that in that particular example, it would be more efficient to inline the if command, so the function call only occurs if necessary.  But I guess that would be difficult to do across module boundaries.

Comments are closed.

Skip to main content