The MIPS R4000, part 8: Control transfer


Let's just get this out of the way.

The MIPS R4000 has branch delay slots.

Ugh.

When you perform a branch instruction, the instruction after the branch instruction is executed, even if the branch is taken. The branch itself is delayed by one instruction.

This takes a lot of getting used to.

And to make things even more confusing, there are situations where the instruction in the branch delay slot is ignored, But lets not get into that yet.

Here's a basic example of a branch delay slot:

    BEQ     r1, zero, dest      ; branch if r1 is equal to zero
    OR      r1, zero, zero      ; set r1 = 0; this line executes regardless
    ...
dest:
    ADDI    r2, zero, 1         ; set r2 = 1

The OR instruction sits in the branch delay slot, and it will execute regardless of whether the branch is taken. It still executes after the branch instruction, so don't think of the entire branch instruction as executing after its delay slot. Only the control transfer part executes late. (In the above example, the BEQ tests the previous value of r1, not the value set by the OR instruction that sits in the delay slot.)

In the above example, the sequence of events is as follows, with time proceeding to the right.

Fetch and decode BEQ instruction. The condition is true, so fetch the next instruction from dest.
Fetch and decode OR instruction. Set r1 to zero.
Fetch and decode ADDI instruction. Set r2 to 1.

When the branch instruction executes, the fetch and decode of the instruction in the branch delay slot is already under way. Instead of throwing away that work, the processor says, "Well, I may as well finish what I've started, seeing as I've already paid for it," and it polishes off its drink before leaving the table to fetch and decode the instruction at the branch destination. The result is that the control transfer doesn't happen until one straggler instruction has already executed.

Basically, this is a trick to avoid a pipeline bubble during branching without needing complicated speculation circuitry. On a system without branch delay slots, the processor has a lot of decisions to make the moment it decodes a branch instruction.

  • A branch predictor decides whether instruction fetching and decoding continues from the branch-taken or the branch-not-taken code path.
  • Speculation circuitry executes the operations from the predicted code path, but doesn't commit the results. Any side effects (such as register updates, memory updates, and exceptions) must be suppressed until the processor determines whether the branch is taken. If speculation was correct, than all side effects are realized. If speculation was incorrect, than all side effects are discarded.

Branch delay slots remove all this complexity. The processor always fetches and executes the straight-line code. And the processor has determined whether the branch is taken by the time it needs to fetch an instruction beyond the the branch delay slot. This means that you don't need a branch predictor, return address predictor, or speculation, thereby reducing chip complexity.

Unfortunately, branch delay slots also expose the internal pipelining. If a future version of the processor has a different pipeline depth, it still needs to emulate the old pipeline timing.¹

Okay, before we get even more bogged down in the intricacies of branch delay slots, let's look at the control transfer instructions. First, the conditional transfers:

    ; all comparisons are signed
    BEQ     rs, rt, dest    ; branch if rs = rt
    BNE     rs, rt, dest    ; branch if rs ≠ rt
    BGEZ    rs, dest        ; branch if rs ≥ 0
    BGTZ    rs, dest        ; branch if rs > 0
    BLEZ    rs, dest        ; branch if rs ≤ 0
    BLTZ    rs, dest        ; branch if rs < 0

The branch instructions have a reach of ±128KB.

To help prepare for one of the above branch instructions, you can use one of these instructions:

    ; "set if less than"
    SLT     rd, rs, rt      ; rd = (( int32_t)rs < ( int32_t)rt)    ? 1 : 0
    SLTI    rd, rs, imm16   ; rd = (( int32_t)rs < ( int16_t)imm16) ? 1 : 0
    SLTU    rd, rs, rt      ; rd = ((uint32_t)rs < (uint32_t)rt)    ? 1 : 0
    SLTIU   rd, rs, imm16   ; rd = ((uint32_t)rs < (uint32_t)(int16_t)imm16) ? 1 : 0

The SLT family of instructions compare two values and set the destination register to 1 if the first comparand is less than the second; otherwise it sets the destination register to 0. The I versions compare against a signed 16-bit immediate value rather than a register, and the U versions use an unsigned comparison instead of signed. Note that SLTIU sign-extends the immediate from a 16-bit value to a 32-bit value, but the comparison is performed as an unsigned value. No arithmetic exceptions are raised by these instructions.

The assembler provides several pseudo-instructions for other types of relative branches, often using one of the SLT instructions to build the branch condition in the at register. Here are a few of them:

    BEQ  zero, zero, dest   ; B dest
                            ; branch unconditional

    BEQ  rs, zero, dest     ; BEQZ rs, dest
                            ; branch if rs = 0

    BNE  rs, zero, dest     ; BNEZ rs, dest
                            ; branch if rs ≠ 0

    LI   at, imm32          ; BEQ rs, imm32, dest
    BEQ  rs, at, dest       ; branch if rs = imm32

    LI   at, imm32          ; BNE rs, imm32, dest
    BNE  rs, at, dest       ; branch if rs ≠ imm32

    SLT  at, rs, rt         ; BLT rs, rt, dest
    BNEZ at, dest           ; branch if rs < rt

    SLTI at, rs, imm16      ; BLT rs, imm16, dest
    BNEZ at, dest           ; branch if rs < imm16

    LI   at, imm32          ; BLT rs, imm32, dest
    SLT  at, rs, at         ;
    BNEZ at, dest           ; branch if rs < imm32

    SLT  at, rs, rt         ; BGE rs, rt, dest
    BEQZ at, dest           ; branch if rs ≥ rt

    SLTU at, rs, rt         ; BGEU rs, rt, dest
    BEQZ at, dest           ; branch if rs ≥ rt (unsigned)

And so on. You get the idea.

The next batch of control transfer instructions are those which perform relative branches and store the return address in the ra register. The return address is the instruction after the branch delay slot.

    ; all comparisons are signed
    BGEZAL  rs, dest        ; branch if rs ≥ 0 and link
    BLTZAL  rs, dest        ; branch if rs < 0 and link

    ; pseudo-instruction
    BGEZAL  zero, dest      ; BAL dest
                            ; branch unconditional and link

The BAL pseudo-instruction is an unconditional branch and link, implemented by encoding a conditional branch and link where the condition is always true (namely 0 ≥ 0).

The next batch is the "likely" branch instructions. These not only include a hint that the branch should be predicted taken, but they also have the extra weirdness that the instruction in the branch delay slot is ignored if the branch is not taken!

Note carefully: The instruction in the branch delay slot is ignored if the branch is not taken. If the branch is taken, then the instruction in the branch delay slot executes normally.²

    ; all comparisons are signed
    BEQL    rs, dest        ; branch if rs = 0, likely
    BNEL    rs, dest        ; branch if rs ≠ 0, likely
    BGEZL   rs, dest        ; branch if rs ≥ 0, likely
    BGTZL   rs, dest        ; branch if rs > 0, likely
    BLEZL   rs, dest        ; branch if rs ≤ 0, likely
    BLTZL   rs, dest        ; branch if rs < 0, likely

    BGEZALL rs, dest        ; branch if rs ≥ 0 and link, likely
    BLTZALL rs, dest        ; branch if rs < 0 and link, likely

The MIPS people presumably reconsidered these instructions, because later versions of the architecture mark them as deprecated.

The last group are the "jump" instructions.

    J       dest            ; jump
    JAL     dest            ; jump and link
                            ; return address stored in ra

    JR      rs              ; jump register
    JALR    rd, rs          ; jump and link register
                            ; return address stored in rd

The first two instructions encode an absolute jump, sort of. The J and JAL instructions have room to express the lower 28 bits of the absolute jump target. The upper four bits of the jump target are copied from the program counter of the branch delay slot. (This is almost always the same as the program counter of the jump instruction itself; it makes a difference only if the jump instruction and the branch delay slot are on opposite sides of a 256MB boundary.)

For example, suppose the program counter is at 0x12345678. This means that the jump instruction can jump to any address in the range 0x10000000 through 0x1FFFFFFC.

This partitioning of the jump target space into 256MB regions means that in practice, a DLL cannot exceed 256MB, and a DLL cannot be relocated so that it straddles a 256MB boundary, because it would be impossible to fix up the jumps that cross the boundary.

The jump register instructions use a register to specify a jump target. Unlike the absolute jump instructions, the jump register instructions can jump to any 32-bit address.

The JALR instruction is the only control transfer instruction that lets you pick the register to receive the return address. In practice, you always pick ra, but the possibility is nevertheless available to pick something else, in case you're doing something wacky.

Okay, now back to branch delay slots.

One rule about branch delay slots is that you cannot put another branch instruction in a branch delay slot. Because that would be Inception-level crazy.

Another rule about branch delay slots is that if an exception occurs while executing the instruction in the branch delay slot, and the kernel decides to resume execution after fixing the problem, execution will resume at the preceding branch instruction.

This is obvious in retrospect, because if execution resumed at the branch delay slot, well, there's no branch instruction active when execution resumes, so execution will fall through, which is bad if the original exception had occurred when executing the instruction in the branch delay slot for a taken branch. Resuming from the instruction that raised the exception would cause the taken branch to become not-taken!

Therefore, the kernel backs up to the branch instruction and resumes execution there. Branch instructions cannot fault, and they modify at most ra; in particular, the register being tested by a conditional branch did not change, so the resumed execution will take or not-take the branch in the same way as the original execution, and the instruction in the branch delay slot will get another chance to execute.

Well, I sort of lied when I said that "the register being tested by a conditional branch did not change": If the register being tested is ra itself, then the branch instruction will indeed modify the register that controls the conditional branch! (Similarly if you write JALR r, r.)

So let's just say that it's not allowed by convention. A branch instruction cannot be conditional on ra if it also modifies ra, and JALR r, r is also not allowed by convention.³

These forbidden instructions are merely software conventions. The processor will gladly let you do these disallowed things, but if you try it, and you take an exception on the instruction in the branch delay slot, then your program will act all wacky when execution is resumed, and you got what you deserved.

Oh yeah, you also shouldn't put a multi-instruction pseudo-instruction in the branch delay slot, because only the first instruction will execute as part of the branch delay slot. I never tried it, but I hope the assembler warns you when this happens.

Branch delay slots are a place you are likely to see NOP instructions. If the compiler can't find anything to put in the branch delay slot, it will just dump a boring NOP in there.

Next time, we'll look at some of the crazy things you can do with branch delay slots. I don't know if any compilers take advantage of them, but they are technically legal.

¹ I'm led to believe that this problem actually occurred. The original MIPS processor was single-issue with a two-stage pipeline. Later versions deepened the pipeline and added multi-issue, which means that a single branch delay slot is not sufficient to avoid a pipeline bubble. So they had to add branch prediction circuitry anyway.

² The story as I heared it is that the MIPS folks noticed that lots of people were putting NOP instructions in the branch delay slot because they couldn't find anything useful to go in there. So the MIPS people added the "likely" version of the branch instructions which allows you to front-load the first instruction of the jump target in the branch delay slot. If the branch is not taken, the instruction in the branch delay slot is ignored, which means that you don't have to worry about the front-loaded instruction interfering with the fall-through code path. Though they ended up deprecating the instructions, so who knows what really happened. Maybe an intrepid reader can dig up a design document.

³ Furthermore, the Windows NT calling convention requires that the return address be passed in ra, so you can cancel your dreams of using JALR on Windows NT with a nonstandard return address register.

Bonus chatter: Intrepid reader laonianren dug up some usenet articles that fill in the background.

The branch-likely instructions were added "only to make it easier to populate the branch delay slots in loops": You can put the first instruction of the loop body in the delay slot, and branch to the second instruction of the loop body. If the branch is not taken, then the instruction in the branch delay slot is converted to a NOP. (It still consumes a cycle, but nothing happens during that cycle.)

The branch-likely instructions were deprecated because "conditionally squashing the result of the delay-slot instruction is a pain in the neck." Processors are required to implement the instructions, but they are not required to implement them efficiently.

Comments (16)

  1. Falcon says:

    I guess the use of delay slots allowed them to… DELAY the introduction of branch prediction logic!

  2. VinDuv says:

    What happens if an interrupt occurs during the execution of a branch delay slot? Does the kernel also has to rewind execution in that case?

    1. Yes. Anything that requires resumption must back up to the branch instruction in order to resume correctly.

  3. laonianren says:

    [1] claims that the “likely” branches offered marginal gains and became harder to implement with high-performance (presumably longer) pipelines. They were deprecated so CPU implementations could choose correct but slow.

    This is not an official document but Tim Olson does seem to be an actual CPU designer.

    [1]: https://compilers.iecc.com/comparch/article/03-09-094

  4. Louis says:

    I don’t understand how the “likely” instructions work. Isn’t it just as complex to “not execute the branch delay slot if the branch is not taken” as it is to “not execute the branch delay slot if the branch is taken” (i.e. regular branching with no branch delay slot)?

    As I typed this I’ve come up with a guess: it’s sort of like a branch predictor where the prediction is always “taken” no matter what, and the instruction in the BDS is backed out if the branch is not taken. The only other possibility I can come up with is that the processor stalls until it knows whether or not to take the branch, but that doesn’t buy anything over a regular branch followed by a NOP.

    1. Joshua says:

      The instruction is bubbled if the branch wasn’t taken. In an in-order processor this will never cause a further stall so it is faster than a NOP in the branch-taken case and no slower in the branch-not-taken case.

    2. Fabian Giesen says:

      The two cases indeed need pretty much the same mechanisms (conditionally nullifying in-flight instructions). But the branch-likely thing saves you a cycle (or, more specifically, lets you put an instruction into a cycle that would otherwise be wasted).

      The whole thing about branch delay slots in general is that by the time the branch actually executes, more instructions have entered the pipeline and might have to get cancelled. If you do cancel them on a taken branch, that means the machine is going to be idle for the next few cycles after a branch. The idea with branch delay slots and branch-likely both is to get some value out of the instructions you’ve already fetched, but the actual use cases are somewhat complementary.

      Into a regular branch delay slot, you can only schedule instructions that you _always_ want to execute. What this usually boils down to is trying to find one instruction from the basic block preceding the branch that the branch itself doesn’t depend on, and moving it into the branch delay slot. That’s great if you have such an instruction, but often you don’t.

      The branch-likely thing allows you to grab an instruction from the basic block you’re branching to. Unless that instruction happens to be a branch itself, you can just grab the first instruction of that block (and then branch straight to the second instruction).

      So having both branch delay slots and the branch-likely thing increases your pool of instructions you can pick from, and reduce the number of wasted cycles on a taken branch.

      All this is nice if your pipeline is short enough that a taken branch only induces one extra cycle of latency (as was the case for the earliest MIPS chips), but by the time of the R4000 the pipeline had gotten longer, there were more cycles to fill, and it had become painfully obvious that clever hacks assuming a particular pipeline and latency were going to be liabilities in the long run. (Both the delay slot and the branch-likely hack are trouble if your pipeline doesn’t work exactly the same way as the pipeline of the machine they were originally designed for.)

      1. Joshua says:

        I found the best instruction to put there was usually the loop incr instruction because you rarely care after the loop what the value of the counter is.

  5. Joshua says:

    If you don’t want to deal with branch delay slots, always put a NOP after any kind of branch. Maybe the compiler will do that for you with -O0 (or whatever the equivalent was in Visual C++ at the time).

  6. Fabian Giesen says:

    The earliest MIPS implementations (MIPS I architecture, the original “Microprocessor without Interlocked Pipeline Stages”) had a five-stage pipeline (IF, ID, EX, MEM, WB) and, if I remember correctly, had some trickery to evaluate the branch conditions early specifically to make sure that only one cycle is “burned” after branches, and then they made that the branch delay slot. (They also had load delay slots, because the result of loads was known by the end of MEM, which finishes at the same time at the succeeding instruction’s EX cycle, i.e. too late.)

    By the R4000 (MIPS III architecture), there were (some) interlocks, load delay slots were gone, load latencies were another cycle longer, and branch delay slots still existed in the ISA but the actual delay was up to 3 cycles by then. (1 cycle covered by the branch delay slot, the next two were inserted speculatively and had to get nullified if the branch was found to be taken.) [Source: https://people.eecs.berkeley.edu/~kubitron/courses/cs252-S07/handouts/papers/R4000.pdf%5D

  7. Simon Clarkstone says:

    I have considered the possibility of a processor with a variable-length branch-delay slot to let programs be written with branch delay slots that could take advantage of future processors with longer pipelines: a jump would be split into an LJR instruction that loaded the jump-destination register (a specific special-purpose jump-register not a general-purpose register, and it must be in its “empty” state before loading), then another XJ instruction that executed the jump and invalidated the contents of the register (effectively the marker for the end of the branch-delay slot). When the instruction-fetcher got to an XJ instruction it would stall if necessary until the instruction-executor loaded the jump register.

    A subroutine that did no branches itself could load its return address into the jump register as its first action before doing its main work.

    I later noticed that the jump register can be thought of like a queue that could have 0 or 1 elements in it, which suggests the generalisation of a longer queue in later processors of the same ISA, for routines that can tell you what their next few jumps will be, e.g. code like
    “if i++ then p else q” will become something like:
    “””
    LOAD i
    LJR_NZ p // LJR conditional on value loaded being non-zero
    LJR_NZ r // LJR conditional on value loaded being non-zero
    LJR_Z q // LJR conditional on value loaded being zero
    INC // in the branch delay slot
    STORE i // in the branch delay slot
    XJ // to p or q
    p:
    … // p stuff
    XJ // always to r
    q:
    … // q stuff
    r:
    “””
    and the instruction fetching could be all the way to “r:” before the “LOAD i” had finished.
    Code like “{if b then p else q} ; r; {if b then s else t}; u” where p/q/r/s do no jumps can happily push 4 addresses onto a jump queue the moment that b is known.

    It’s not clear how this would interact with exceptions or interrupts, and it’s a bit crippled when you have subroutine call in either branch. For that matter, there’s probably not much point in mechanisms to let the instruction fetching get so far ahead of the instruction execution when delay slots are *already* hard to fill. Still a fun thing to think about.

    1. Erkin Alp Güney says:

      What makes you believe XJ finalise-jump would be accomplished in single cycle?

  8. Neil says:

    So whenever there’s an unexpected kernel transition, the kernel has to check whether the previous instruction is a branch so that it can resume it correctly? How does the kernel even know that the previous instruction exists? (I’m thinking you might have just jumped to the first instruction on a page and the previous page doesn’t exist.)

    1. Giedrius says:

      I’d guess processor never updates IP to branch delay slot. So if kernel transition happens, kernel still sees the original branch IP.

    2. MIPS is well-documented on the Internet, probably because many schools use it in their assembly language / computer architecture class. Short answer: The processor provides enough information for the kernel to figure out what happened. For example, the BD bit is set if the exception occurred in a branch delay slot.

  9. Pietro Gagliardi (andlabs) says:

    Could be worse: could be SuperH, where some instructions have branch delay and some don’t! :D (And in one case, you can even opt in to branch delays with the /s suffix.)

Skip to main content