The MIPS R4000, part 9: Stupid branch delay slot tricks

Last time, we learned about the MIPS branch delay slot. Today, we'll look at some tricks you can play with the branch delay slot.

First trick: It is legal to jump into a branch delay slot. Of course, it's not a branch delay slot when you do that. This lets you write some wacky-looking code:

    B       somewhere           ; unconditional branch
    OR      v0, zero, zero      ; v0 = 0

When the unconditional branch is taken, the v0 register is set to zero before execution continues at the branch destination.

Meanwhile, if somebody jumps to label, then execution continues at label, which sets v0 to zero, and then continues with other stuff.

The instruction at label acts both as the branch delay slot for the unconditional branch that precedes it, but it's also the first instruction in the basic block if somebody jumps directly into it.

I've seen the opportunity arise for this sort of "squeeze out a single instruction" optimization, but the Microsoft compiler doesn't take advantage of it. Which is probably a good thing. (For one thing, it makes it much harder for binary transformation tools to decompose a program into basic blocks and recombine them in different ways.)

Another stupid branch delay slot trick is editing the return address as part of the jump.

    BAL     somewhere
    ADDIU   ra, ra, 4


    NOP ; the routine returns here!

The BAL instruction sets the ra register to point to the instruction after the branch delay slot, which in our case is the first NOP. But in the branch delay slot, we modify the ra register, so that when execution reaches the start of the called procedure, it gets an artificial return address.

I'm told this sort of trick is used by some compilers to combine a call and an unconditional jump into a call with fake return address. For example, in this code fragment

    if (...) {
    } else {
    // resume
    x = 0;

the call to function1 is probably followed by an unconditional jump to skip over the else branch.

    BAL     function1
    NOP                     ; garbage in the branch delay slot
    B       resume
    OR      v0, zero, zero  ; set x = 0

    ... else-branch code goes here ...

    OR      v0, zero, zero  ; set x = 0

A sneaky compiler could generate the following code:

    BAL     function1
    ADDIU   ra, ra, resume - nominal_return ; tweak return address

    ... else-branch code goes here ...

    OR      v0, zero, zero  ; set x = 0

In the branch delay slot, we edit the return address so that when function1 returns, it resumes execution at resume rather than nominal_return, thereby avoiding having to execute another branch instruction. (We also were able to remove the duplicate OR v0, zero, zero instruction that had been hoisted into the branch delay slot of the unconditional branch.) Note that you get this savings only because you had a garbage NOP in the branch delay slot. If there were a useful instruction there, then the transformation would go like this:

    // original code
    BAL     function1
    MOVE    a0, r0      ; set parameter for function
    B       resume
    OR      v0, zero, zero  ; set x = 0

    // sneaky code
    MOVE    a0, r0      ; set parameter for function
    BAL     function1
    ADDU    ra, ra, ... ; tweak return address

    OR      v0, zero, zero  ; set x = 0

The instruction in the BAL instruction's branch delay slot would have to go somewhere else, so you didn't save any time (though you still saved one instruction of space by avoiding duplication of the OR v0, zero, zero).

But as we saw earlier, this trick defeats the return address predictor,¹ so it's probably a bad idea.

Okay, next time, we're going to look at the calling convention a bit more closely.

Bonus chatter: Another extra sneaky trick is reusing the return address. Suppose your interpreter loop goes like this:

void interpreter_loop(interpreter_state* state)
 for (;;) {
  uint32_t opcode = *state->pc;
  jump_table[opcode](state, opcode, state->pc);

The interpreter loop just dispatches to the next opcode forever. Presumably you would break out of this loop with a longjmp or some other nonlocal transfer.

The handler function is given the current interpreter state (so it can update it), and as a courtesy, it also gets the current opcode and a pointer to the next unparsed byte as a convenience.

    MOVE    s0, a0       ; s0 points to the interpreter state
    LA      s1, jump_table
    LA      ra, next_opcode ; Footnote ²
    LW      v1, 80(s0)  ; get address of next opcode byte
    ADDU    a2, v1, 1   ; move to next opcode byte (also argument for handler)
    LBU     a1, 0(v1)   ; load current opcode byte (also argument for handler)
    SW      a2, 80(s0)  ; save pointer to next opcode byte
    SLL     t0, a1, 2   ; multiple by 4 to index jump table
    ADDU    t0, t0, s1  ; calculate entry in jump table
    LW      v0, 0(t0)   ; load the jump target
    JR      v0          ; jump to handler - will return to next_opcode
    MOVE    a0, s0      ; argument for handler

When we call the first handler, ra is set equal to next_opcode. That handler will do its work and then return to the caller by restoring the return address to the ra register and performing a JR ra.

This means that when control returns to next_opcode, you know that ra is equal to next_opcode! Since that's the value you wanted to be in that register anyway, you can just leave it there when you jump to the next handler, saving you the trouble of having to branch back up to next_opcode explicitly.

This seems to be a really clever trick, but it is probably not that useful in practice because of that return address predictor thing.

¹ On the other hand, the MIPS R4000 does not have separate opcodes for "jump indirect to register" and "jump indirect to register for the purpose of returning"; it uses the JR instruction for both cases.

The inability to distinguish whether a jump instruction was semantically a return instruction was a non-issue in the original implementation of the MIPS architecture. It had only a two-stage pipeline, so the single branch delay slot was sufficient to avoid ever needing to predict any branches at all.

The MIPS R4000 had a four-stage pipeline, and a branch misprediction would consequently suffer a 2-cycle stall. The MIPS designers codified existing practice and retroactively declared that if the register operand in the JR instruction is ra, then it predicts as a subroutine return; otherwise it predicts as a computed jump.

² For extra sneakiness (and to save an instruction),³ the loop preparation code could have been written as

    LA      s1, jump_table
    BAL     next_opcode
    MOVE    s0, a0       ; s0 points to the interpreter state

This version lets the processor calculate the address of next_opcode by performing a BAL. This sets the return address to the instruction after the branch delay slot, which is next_opcode, and then jumps to… next_opcode, which is where the instruction would have gone anyway.

³ Mind you, this size savings costs you a pipeline stall. See footnote 1.

Comments (13)

  1. AberAber says:

    Another “trick”, is you can branch to yourself. I don’t know how useful, except for hanging your program, but you can do it (branch offset of 0). Perhaps with the display slot you can do something simple recursively.

  2. Cesar says:

    Speaking of the return address predictor, one of the tricks used to workaround parts of the Spectre vulnerability is to confuse the return address predictor so that the speculative execution goes to somewhere innocuous (like an infinite loop and/or a speculation barrier), while the real execution goes somewhere else. That is, it works by deliberately doing the opposite of the recommended way to deal with a return address predictor.

  3. mZ says:

    If you jump into delay slot and interrupt occurs, the interrupt handler will modify return address to point to the branch instruction and program will misbehave. Am I right?

    1. If you jump into someone else’s delay slot, then it’s not a delay slot.

      1. Matteo Italia says:

        That’s a thing I was wondering – disregarding interrupts, if you jumped in the delay slot did it get executed twice?

      2. mZ says:

        How does interrupt handler know that it’s not a delay slot?

        1. smf says:

          A delay slot is created by a branch executing directly before it, what is in the delay slot never knows that it’s a delay slot. If you jump into what is a branch delay slot, then the preceding branch has it’s delay slot elsewhere.

          With MIPS it helps to think of the pipeline, so a load or a branch doesn’t update the registers directly. The register you load into or the program counter doesn’t get updated until the next instruction has been executed.

    2. Someone says:

      To quote Raymond: “the BD bit is set if the exception occurred in a branch delay slot.”

      1. mZ says:

        Thank you. I overlooked this.

  4. Joshua says:

    It has the value of ra already; I don’t get why the prediction is 100% accurate unless ra changed in the previous 2 instructions.

  5. Stéphan Leclercq says:

    What if the instruction in the delay slot is itself a jump instruction?

    1. As noted in the previous article, that is not allowed.

  6. kme says:

    Would it also work to branch into the delay slot of the same branch? eg:

    BEQ r1, zero, label
    ADDU r2, r3

    Here the straight-line code performs the add once, but the branch performs it twice.

Skip to main content