The Itanium processor, part 5: The GP register, calling functions, and function pointers


We saw a brief mention of the gp register last time, where we saw it used when we calculated the address of a global variable.

The only addressing mode supported by the Itanium processor is register indirect (possibly with post-increment). There is no absolute addressing mode. If you want to access a global variable, you need to calculate its address, and the convention for this is that the gp register points to the module's global variables. If you want to access a global variable stored at offset n in the global data segment, you do it in two steps:

        addl    r30 = n, gp ;;    // r30 -> global variable
        ld4     r30 = [r30]       // load 4 bytes from the global variable

The name gp stands for global pointer since it is the pointer used to access global variables. (Note that since immediates are signed, the range of values of n is −2MB to +2MB.)

Those of you familiar with the PowerPC will recognize this model, since it is very similar to the Table of Contents model, except that Itanium uses a single table of contents for the entire module, as opposed to the PowerPC which gives each function its own table of contents.

The Itanium addl instruction is limited to a 22-bit immediate, which provides a reach of 4MB. This means that the above pattern is viable only for 4MB of global variables. Since some modules have more than 4MB of global data, the compiler separates global data into two categories, large and small. Small data objects are stored directly in the global data segment, but large data objects are not. Instead, the large data object is placed outside the global data segment, and all that is placed in the global data segment is a pointer to the large object. This means that accessing a large object actually takes three instructions.

        addl    r30 = n, gp ;; // r30 -> global variable forwarder
        ld8     r30 = [r30] ;; // r30 -> global variable
        ld4     r30 = [r30]    // load 4 bytes from the global variable

We see that it is vitally important that the gp register be set properly. Otherwise, the code has no idea where its global variables are. The Itanium calling convention says that on entry to a function, the gp register must be set to that function's global pointer.

Okay, so if you're going to call a function, how do you know what global pointer it expects?

Since all functions in the same module share the same global variables, the answer is easy if you are calling a function within the same module: You don't need to do anything special with gp, since the caller's gp is the same as the callee's gp. You also don't need to perform an indirect call; you know where the target is and can use a direct br.call OtherFunction

On the other hand, if you are calling a function through a function pointer, then the target of the call might belong to another module. How are you supposed to know what the target function wants gp to be?

The answer is that on Itanium, a function pointer is not the address of the first instruction. Rather, it is a pointer to a structure containing two pointers. The first pointer in the structure points to the first instruction of the target function. The second pointer is the target function's gp. Therefore, calling a function through a function pointer looks like this:

        // suppose the function pointer is in r30
        ld8     r31 = [r30], 8 ;;       // get the function address
                                        // then add 8 to r30
        ld8     gp = [r30]              // get the function's gp
        mov     b6 = r31                // move to branch register
        br.call.dptk.many rp = b6 ;;    // call function in b6
        or      gp = r41, r0            // gp = r41 OR 0 = r41

First, we load the address of the first instruction into the r31 register, using a post-increment addressing mode so that r30 after the instruction points to the callee's gp.

Next, we load the gp register with the caller's gp. Simultaneously, we move r31 to b6 so that we can use it as the target of the br.call. (Branch registers cannot be the target of a ld8 instruction, which is why we needed to use r31 as a middle-man.)

Now that gp is set up properly, we can call the function through the branch register.

After the call returns, the gp register is now whatever value is left over by the function we called. We need to set gp to the current function's global pointer, which for the sake of example we'll assume had been saved in the r41 register.

There's yet another wrinkle: The naïve imported function. In the case of an imported function not declared with the dllimport attribute, the compiler doesn't know that the function is imported. It acts as if the function is part of the current module. On x86, this is simulated by making a stub function which jumps to the real (imported) function. On Itanium, the same thing is done, with a stub function that looks like this:

.ImportedFunction:
        addl    r30 = n, gp ;;      // r30 -> function descriptor
        ld8     r31 = [r30], 8;;    // get the function address
                                    // then add 8 to r30
        ld8     gp = [r30]          // get the function's gp
        mov     b6 = r31            // move to branch register
        br.cond.sptk.many b6 ;;     // jump there

The stub function loads the gp register with the value expected by the imported function then jumps to the imported function. Unconditional computed jumps are encoded as conditional jumps where the qualifying predicate is p0, which is always true.

The possibility that any function is really a stub function for an imported function this creates a problem for the compiler: Since any function could be an imported function in disguise, the compiler must assume that any function is potentially imported and therefore may result in the gp register being trashed. Therefore, the compiler needs to restore the gp register after any function call.

Now, the above pessimistic assumption can be relaxed if the compiler has other information available to it. For example, if the function being called is in the same translation unit, then the compiler can see by inspection that the target function is not a stub and therefore can elide the restoration of gp. Similarly, if link-time code generation is enabled, then the linker can see all the code in the module and see whether the target function is a stub or a real function.

Exercise: How does tail-call elimination affect this optimization?

Bonus reading: Programming for 64-bit Windows which spends nearly all its time talking about the gp register.

¹ The direct call instruction has a reach of 16MB, so if the function you want to call is too far away, the linker redirects the br.call to a stub function which in turn jumps to the final destination.

    br.call.dptk.many stub_for_OtherFunction
...

stub_for_OtherFunction:
    ... jump to OtherFunction ...

You have a few options for jumping to the function.

  • If the stub is within 16MB of the target, it can use a br.cond direct jump:
stub_for_OtherFunction:
    br.cond.sptk.many OtherFunction
  • The stub can load the target address from the data segment and use an indirect jump:
stub_for_OtherFunction:
    addl r3 = n, gp ;;  // look up the function address
    ld8  r3 = [r3] ;;   // fetch it
    mov  b6 = r3 ;;     // prepare to jump there
    br.cond.sptk.many b6 ;; // and off we go
  • The stub can load the target address offset from data stored in the code segment, then apply the offset to the current instruction pointer to determine the target:
stub_for_OtherFunction:
    mov  r3 = iip ;;    // get current location
    addl r3 = n, r3 ;;  // find the offset
    ld8  r2 = [r3] ;;   // load the offset
    addl r2 = r2, r3 ;; // apply to current location
    mov  b6 = r2 ;;     // prepare to jump there
    br.cond.sptk.many b6 ;; // and off we go

This last case is tricky because the Itanium conventions forbid relocations in the code segment; all code is position-independent. Therefore, the data in the code segment must not be relocatable. We work around this by storing an offset rather than the absolute address and applying the offset at runtime.

Comments (22)
  1. Joshua says:

    I'm surprised the Windows team didn't reuse the old rule in C that allows sizeof (void (*)()) > sizeof (void *).

  2. anonymouscommenter says:

    How you don't totally hate Itanium?

  3. Xv8 says:

    @Joshua

    Because you would break so much existing code. (POSIX defines sizeof(void(*)() <= sizeof(void *), and it's been an assumption on win16 and existing win32 code since forever).

  4. anonymouscommenter says:

    The lack of a reg+imm addressing mode is a strong contender for the single worst design decision in this architecture.  And the existence of the postincrement mode (together with the rotating register feature) tells you something about the kind of code the designers thought was worth caring about -- namely, the same kind of code a DSP is good at.

    In fact, there's a case to be made that the Itanium isn't a general purpose CPU at all, but rather a ridiculously feature-rich DSP.

  5. anonymouscommenter says:

    1, 2, 3, 3b, 5?

    [Actually, 1, 2, 3, 4, 3b, 5. -Raymond]
  6. Zan Lynx' says:

    Zack, how many bits would a reg+imm instruction take up? Now compare that to doing a register load, add a value to the register, and load indirect from that register.

    Remember that it is a bit like RISC. Instructions have to fit into slots. They cannot be just whatever length is convenient.

    So just believe in your mind that the three instructions make a single reg+imm address mode.

  7. anonymouscommenter says:

    @zack A load of the Itanium designers were poached by AMD in 2006, so hopefully they learnt their lesson about what kind of code a CPU needs to be good at. They may not have done, AMD had already made their 64 bit chips and Intel were nowhere close to matching them but since then Intel have really excelled and AMD are essentially back where they started trying to make money on cheap hardware.

    In the mean time Itanium has been like the Terminator, it absolutely will not stop, ever. Anyone who tries to stop supporting it gets sued. However Kittson is supposed to be out soon/now and there is no sign. Maybe Intel has figured that not enough people care anymore.

  8. anonymouscommenter says:

    > Zack, how many bits would a reg+imm instruction take up? ... Remember that it is a bit like RISC.

    Somehow reg+imm addressing modes are very common even on RISC architectures, even on those with 32-bit instructions. The Itanium needs two more bits to address each register and a typical RISC ISA has ld/str instructions with two register operands, which would take up only 4 of the 9 extra bits per instruction. ARM, for example, supports predicating all (or at least almost all) instructions and yet supports reg+imm.

    Now, you may well be right that Intel decided that they needed more than even five extra bits/instruction on other things, but that's not necessarily contradictory to believing that's a bad decision. :-)

  9. anonymouscommenter says:

    Normally I love all the arcane details you give us, but this is the most boring series of blog posts EVER.  I remember looking at disassembly for this processor, and decided my days as an asm coder were over.  Then the x64 arrived, and I got a contract to port old x86 code over - and I was back in business!  Still, x64 asm is hard to read, too bad they couldn't think up slightly more meaningful mnemonics than "r#".  

     "Anyone who tries to stop supporting it gets sued."

    Well, then, my prescience was fortuitous, because I can't get sued for stopping, because I never started!

     "In fact, there's a case to be made that the Itanium isn't a general purpose CPU at all, but rather a ridiculously feature-rich DSP"

    That is the best explanation I ever heard for it!  

  10. anonymouscommenter says:

    @boogaloo: Who is included in "anyone" other than Oracle?

  11. Zack says:

    As Evan points out, reg+imm is ubiquitous or nearly so on classic RISC, and for damn good reason: it's good for something like 33% code size reduction on SPECint-like code.  Yes, 33%.  That is how often it gets used.  Itanium doesn't need it for spilling registers to the stack, which account for a fair chunk of that 33%, but it does need it for globals, structure members, vtable slots, unrolled loops (which Itanium otherwise *loves*), aggregates on the stack, etc.

    I don't *know*, but I suspect it was left out not because of any lack of instruction encoding space, but because they didn't want complex address-generation circuitry and/or could send memory accesses out to the cache a pipeline stage earlier without it.

    Many of the Itanium design decisions I don't like, have in common that they make code less dense, particularly when there isn't much ILP to be had.  Code density translates directly to better I-cache utilization and thus to better performance across the board.  I *think* it was obvious even at the time that efficient cache utilization was becoming *the* dominant factor in overall performance, but I could be misremembering.

  12. anonymouscommenter says:

    Itanium doesn't have any addressing mode because it is supposed to be explicit parallel with absolutely no form of reordering inside the processor, making the load pipeline longer by adding address modes would either reduce clock or increase load latency to 2 cycles, in other words, the same latency as addl/ld8 sequence, Intel avoided multi-cycle latency instructions at all costs, even integer multiplication suffered as a result.

    About code density, at 128 bits per 3 instructions is outrageous, x86 lives with an average of 3 bytes per much more complex instruction and ARM is a mix of 16 and 32 bits instructions, lack of addressing modes is the least of code density problems.

  13. Muzer_ says:

    @ZZZzzz... snarf I disagree with your first sentence.

  14. anonymouscommenter says:

    @Raymond, sorry for the content-free comment, but I also disagree with "ZZZzzz... snarf": this has been one of your most interesting blog series so far. It's one thing to read dry descriptions about that esoteric beast of an architecture, it's another thing to read about it from somebody who worked with it before it was known to be a dead end.

    And for those who need some "brain bleach" after reading about IA-64, here's the ISA documentation for a very clean modern RISC variant: riscv.org/.../riscv-spec-v2.0.pdf (from the makers of the original RISC).

  15. Bill P. Godfrey says:

    I'm enjoying this and I'd be very interested in a similar series on the ARM processor.

  16. Christian says:

    You must have waited for this question: What happens if a module wants to have 524289 global objects?

    [That issue never came up. I assume you got a linker error. -Raymond]
  17. Bulletmagnet says:

    The "Programming for 64-bit Windows" link doesn't work for me: it redirects to "MSDN Magazine Issues and Downloads" (msdn.microsoft.com/.../ee310108.aspx)  :(

  18. acq says:

    Anybody knows when approximately the article "Programming for 64-bit Windows" appeared, it's hard finding it otherwise? Unfortunately old issues of MSDN mag are CHM files between 2003 and 2008 and online only from 2009 on, so even search doesn't work. Archive.org seems not to have it too under the given ee310108.aspx link.

  19. Mark says:

    Bulletmagnet: web.archive.org/.../bb985017.aspx

    (That was hard.)

  20. Mark says:

    acq: ee310108 is the one that MSDN redirects to now that the pre-2003 issues have been removed. You need to use the one Raymond linked (or my previous link).

  21. acq says:

    Oh, and I've also enjoyed reading these articles, I don't care that I've never worked with the computer with such a CPU and that they are probably never going to be used more.

  22. Andrei Warkentin says:

    While perhaps the NT LE calling convention was different, at least for the PowerPC64 ABI the TOC is per-module, so each function in a module is entered with the same TOC value. It would be strange to have a per-function TOC, given that the TOC describes a module's globals, and maintaining multiple TOC tables would be a strange burden. Now, every function has their own calculation for computing the TOC when called as global function, because the TOC = function address + offset.

    Btw, it would be great to see a line of articles about NT on the PPC PreP machines...

    A

Comments are closed.

Skip to main content