The MIPS R4000, part 10: Trampolines and stubs

We saw earlier that the relative branch instructions have a reach of ±128KB, but what if the function you want to call is further away than that?

The linker detects that the branch target is too far away and creates a trampoline stub, the same way the Alpha AXP compiler did. The branch instruction is then rewritten to be a branch to the trampoline, and the trampoline performs another jump to the final destination.

    BAL     dest_trampoline ; was "BAL dest"
    J       dest
    NOP                     ; branch delay slot

The limited reach of the relative branch instructions means that a single function cannot be larger than 256KB.

On the other hand, the existence of the JAL instruction means that the compiler doesn't really need to use BAL at all. Trampolines come into play only with conditional calls, and those are relatively rare.

If a function is naïvely-imported, then the compiler will generate a normal branch-and-link instruction, and the import library will provide a stub that in turn jumps indirectly through the import table. Doing this requires a scratch register, and that's where the at register once again enters the picture:

    BAL     imported_function_stub

    LUI     at, XXXX
    LW      at, YYYY(at)    ; load from import address table
    JR      at              ; and jump there
    NOP                     ; branch delay slot

Where XXXX and YYYY are computed in the usual way, namely, so that (XXXX << 16) + (int16_t)YYYY is the address of the import address table entry.

If you are unlucky enough that you are calling an imported function naïvely, and the imported stub is beyond the reach of the BAL instruction, then you will have to bounce through two trampolines! The first was generated by the linker to help you reach the stub, and the second came from the import library to get you from the stub to the final destination.

This double-trampoline could also happen on the Alpha AXP, but it was less common because the Alpha AXP's relative branch instructions have a reach of ±4MB, as opposed to the MIPS R4000's relative branch instructions, which can reach only ±128KB.

Comments (8)
  1. Antonio Rodríguez says:

    Most (all?) 8-bit processors store the displacement of relative jumps in a single byte, so they are limited to a range of ±127 *bytes* (more like -126 to +129, taking into account the instruction’s two bytes). In my days of programming the Apple II in machine code and assembler, I didn’t found that limited range to be a big problem (I rarely had to use a trampoline). And I’d say that it’s good, because it forces to think about your code: if your functions are so big that you have to ask what the relative addressing limit is, then you are doing something wrong.

    1. Joshua says:

      I encountered the x86 byte limit too frequently. Mid-function trampolines. Joy.

      On an unrelated note a size-speed optimization is available: all the nops between the trampolines of the second form can be omitted for a minor speed penalty if the trampoline was correctly predicted. The nops after the last one has to remain or something other than the at register gets corrupted.

      1. On the other hand, packing the trampolines too tightly means that you have jump instructions so close to each other that they occupy the same slot in the branch predictor, so one of them will always mispredict. Hopefully your trampolines are not in high-performance code paths.

    2. Neil says:

      Aggressive inlining perhaps?

  2. Erik F says:

    Would the old 8086 “anti-test” pattern be used by the compiler anymore, or does that mess up the branch predictor (e.g. instead of “JZ”, you use “JNZ $+xx” followed by “JMP yyyy”)? (BTW, is that basically a trampoline? I really never heard of that term until the late 1990s.)

    1. If the branch target is another function, the distance isn’t known until link time, at which point it’s too late to re-run the compiler to say, “Hey, could you use the anti-test pattern for this specific BGEZAL instruction?” (Has no effect on the branch predictor since you merely exchanged one conditional branch for another with the exact opposite taken/not-taken pattern. All branch predictors I’ve seen are symmetric with respect to taken and not-taken. And the unconditional branch doesn’t need to be predicted at all.)

      1. Erik F says:

        That’s fair: I was only thinking of intra-function (and possibly intra-module without LTO) jumps.

      2. smf says:

        It’s not too late, that is what Link Time Optimisation is for. I guess nobody thought it was worth the investment back in the R4000 days.

Comments are closed.

Skip to main content