The MIPS R4000 can perform multiplication and division in hardware,
but it does so in an unusual way, and this is where the temperamental
`HI` and `LO` registers enter the picture.

The `HI` and `LO` registers are 32-bit registers
which hold or accumulate the results of a multiplication or addition.
You cannot operate on them directly.
They are set by a suitable arithmetic operation,
and by special instructions for moving values in and out.

The multiplication instructions treat
`HI` and `LO`
as a logical 64-bit register,
where
the high-order 32 bits are in the `HI` register
and the low-order 32 bits are in the `LO` register.

MUL rd, rs, rt ; rd = rs * rt, corrupts HI and LO MULT rs, rt ; HI:LO = rs * rt (signed) MULTU rs, rt ; HI:LO = rs * rt (unsigned)

The simplest version is `MUL`

which multiples two
32-bit registers and stores a 32-bit result into a general-purpose register.
As a side effect, it corrupts the `HI` and `LO` registers.
(This is the only multiplication or division operation that puts the result
in a general-purpose register instead of into
`HI` and `LO`.)

The `MULT`

instruction multiplies two signed 32-bit values
to form a 64-bit result,
which it stores in `HI` and `LO`.

The `MULTU`

instruction does the same thing,
but treats the factors as unsigned.

The next group of multiplication instructions performs accumulation.

MADD rs, rt ; HI:LO += rs * rt (signed) MADDU rs, rt ; HI:LO += rs * rt (unsigned) MSUB rs, rt ; HI:LO -= rs * rt (signed) MSUBU rs, rt ; HI:LO -= rs * rt (unsigned)

After performing the appropriate multiplication operation,
the 64-bit result is added to or subtracted from the value currently
in the `HI` and `LO` registers.

Note that the `U`

suffix applies to the signed-ness
of the multiplication, not to whether the operation traps on
signed overflow during addition or subtraction.
None of the multiplication instructions trap.

The operation runs faster if you put the smaller factor in `rt`,
so if you know (or suspect) that one of the values is smaller than the
other, you can try to arrange for the smaller number to be in `rt`.

You might think that the division operations take a 64-bit value
in `HI` and `LO` and divide it by a 32-bit register.
But you'd be wrong.
They divide a 32-bit value by another 32-bit value and store the
quotient and remainder in in `HI` and `LO`.

DIV rd, rs, rt ; LO = rs / rt, HI = rs % rt (signed) DIVU rd, rs, rt ; LO = rs / rt, HI = rs % rt (unsigned)

None of the division operations trap,
not even for overflow or divide-by-zero.
If you divide by zero or incur division overflow, the results in
`HI` and `LO` are garbage.
If you care about overflow or division by zero,
you need to check for it explicitly.

Okay, that's great.
We've done some calculations and put the results into
`HI` and `LO`.
But how do we get the answer out?
(And how do you put the initial values in, if you are using
`MADD`

or `MSUB`

?)

MFHI rd ; rd = HI "move from HI" MFLO rd ; rd = LO "move from LO" MTHI rs ; HI = rs "move to HI" MTLO rs ; LO = rs "move to LO"

The multiplication and division operations take some time to execute,¹ and if you try to read the results too soon, you will stall until the results are available. Therefore, it's best to distract yourself with some other operations while waiting for the multiplication or division operation to do its thing. (For example, you might check if you need to raise a runtime exception because you just asked the processor to divide by zero.)

The temperamental part of the
`HI` and `LO` registers is in how you read
the values out.

Tricky rule number one:
Once you perform a `MTHI`

or `MTLO`

instruction,
*both* of the previous values in
`HI` and `LO` are lost.
That means you can't do this:

MULT r1, r2 ; HI:LO = r1 * r2 (signed) ... stuff that doesn't involve HI or LO ... MTHI r3 ; HI = r3 ... stuff that doesn't involve HI or LO ... MFLO r4 ; r4 = GARBAGE

You might naïvely think that the `MTHI`

replaces
the value in the `HI` register and leaves the
`LO` register alone,
but since this is the first write to either of the
`HI` or `LO` registers since the last
multiplication or division operation,
*both* registers are lost, and your attempt to fetch
the value of `LO` will return garbage.

Note that this applies only to the first write to `HI`
or `LO`.
The second write behaves as you would expect.
For example,
if you perform `MTHI`

followed by `MTLO`

,
the `MTHI`

will set `HI` and corrupt `LO`,
but the `MTLO`

will set `LO` and leave
`HI` alone.

Tricky rule number two:
If you try to read a value from `HI` or `LO`,
you must wait two instructions before performing any operation
that writes to
`HI` or `LO`.
Otherwise, the reads will produce garbage.
The instruction that writes to
`HI` or `LO`
could be a multiplication or division operation, or it could be
`MTHI`

or `MTLO`

.

Tricky rule number two means that the following sequence is invalid:

DIV r1, r2 ; LO = r1 / r2, HI = r1 % r2 (signed) ... stuff that doesn't involve HI or LO ... MFHI r3 ; r3 =~~r1 % r2~~GARBAGE MULT r4, r5 ; HI:LO = r4 * r5 (signed)

Since the `MULT`

comes too soon after the
`MFHI`

, the `MFHI`

will put garbage
into `r3`.
You need to stick two instructions between the
`MFHI`

and the `MULT`

in order to avoid this.

(Tricky rule number two was removed in the R8000.
On the R8000, if you perform a multiplication or division or
`MTxx`

too soon after a `MFxx`

,
the processor will stall until the danger window has passed.)

Okay, next time we'll look at constants.

¹ Wikipedia says that latency of 32-bit multiplication was 10 cycles, and latency of 32-bit division was a whopping 69 cycles.

In MIPS 4400 compilers, after MULT or MULTU operations, compilers would place a NOOP always, I guess because it was so slow and maybe needed two cycles. Not sure if this was required on MIPS 4000 here.

MIPS originally meant “Microprocessor without interlocked pipeline stages” though it stopped meaning that at some point when later MIPS chips had interlocks.

In the original MIPS there were no interlocks. If you had an instruction that depended on the results of a previous arithmetic instruction then forwarding made it work. For loads that couldn’t work so the assembler inserted a NOP. It sounds like it was the same on multiplies.

I was surprised by the separate signed and unsigned multiplication instructions because they are the same operation in 2’s complement arithmetic. I later realised that that only applies when the output of your multiplication is the same size as both of its inputs.

This shows up in other architectures too. For example, x86 effectively has three forms of register-times-register multiplication (which, somewhat confusingly, are mapped onto only two different mnemonics, even though they use different opcodes): MUL (no operand) and IMUL (no operand) are unsigned/signed long multiplication respectively (result twice the size of either operand), while “IMUL dst, src” is “dst *= src” with the result the same size as the operands. There is no “MUL dst, src”, and there doesn’t need to be, because the low half of the full (double-width) product is identical for signed and unsigned two’s complement multiplies.

“If you try to read a value from HI or LO, you must wait two instructions before performing any operation that writes to HI or LO. Otherwise, the reads will produce garbage.”

That’s the craziest thing I have ever heard. I suppose it isn’t obscure to people who wrote code for this architecture, though. Glad the “return garbage” thing was removed for the R8000.

“That’s the craziest thing I have ever heard.”

MIPS was originally designed to make the pipelines completely visible to the programmer and make it their problem. By the R4000 they had relaxed that a little bit, but not entirely.

coprocessors and the multiply/divide aren’t coupled to the cpu pipeline & when an exception occurs it can roll back the cpu but any upcoming coprocessor or multiply/divide will still be executed and isn’t rolled back. This becomes a real pain if the instructions aren’t re-runnable (multiply and divide are, but the PS1 GTE is not as many instructions use the same source and destination registers).

The LR33300 used as a basis for the PS1 CPU is slightly different to the R3000, it’s both crazy and beautiful.

Yeah, my thoughts exactly. WHY?! Just… WHY???

These kinds of internal constraints exist in almost all chips – in this case, likely because the HI/LO registers live off the main integer datapath and don’t have full bypassing, unlike regular integer registers. The other option would be for the hardware to stall there, but this doesn’t happen automatically! You need to add logic to determine whether you should stall, you need to tell upstream parts of the pipeline not to deliver new instructions for the next 2 cycles (a long wire which is often a timing bottleneck!), you need to delay any instructions that might have already been delivered for those 2 cycles (“skid buffers”, etc.) The R4000 went for the much simpler (from the HW point of view) solution of requiring the compiler (and any assembly-language programmers) to work around it in SW. This keeps the HW straightforward, but long-term, detritus from random implementation details in several chip generations starts accruing in your ISA, and all future HW you build (even if it works quite differently) is stuck with it. MIPS had a few such artifacts, such as the load delay slot (eliminated in the MIPS II architecture revision) and branch delay slots (still present in the R4000, and probably going to be mentioned in one of the forthcoming posts in this series.)

Newer processors tends to avoid exposing such implementation details (outside of more specialized HW like DSPs, where everything goes). HW vendors by now have plenty of experience with the pains that shortcuts in the initial implementations cause. :)

It’s interesting what choices can be made. Keep the hardware very simple and fast (which was pretty much the RISC philosophy), vs. make the hardware do a lot of stuff so the programmer doesn’t have to think about it.

I believe the RISC philosophy was that if you could make the hardware simple enough, it would be so much faster that the same end goals would be accomplished in less time. Since those years, with advances in all areas of chip design and also in software and compiler design, the same tradeoffs might not be considered good tradeoffs these days.

If and when programmers keep making the same mistakes due to obscure things in the hardware, where do you put the fix, in hardware or software? Could go either way…

The issues with HI and LO are pipeline hazards, so the behavior depends on the microarchitecture of the specific model you’re using, and also on what else is going on. So most of the “will return garbage” should be read as “may return garbage”… in other words, will work perfectly well most of the time and occasionally result in nonrepeatable weird behavior.

In the case of the two-cycle delay, the issue arises because the multiply/divide unit state isn’t reset on exception; so if you MFLO and then start another multiply and then also get an interrupt, the new multiply isn’t aborted and restarting the MFLO after the exception is handled will get the result from the new multiply. (Likely the exception handler will save and restore HI and LO rather than leaving them untouched, but that produces the same net result.)

Also, on DIV/DIVU: the raw hardware instruction does not check for zero, but the user-visible DIV and DIVU are assembler-level synthetic instructions that insert a zero check (and for DIV, an overflow check) which uses the BREAK instruction to trap. You can issue the raw instruction from the assembler, but you have to specifically write “div $0, rs, rt” to get it; otherwise you get the checks.

“you might check if you need to raise a runtime exception because you just asked the processor to divide by zero”

soo… first divide, then check for zero? that’s different..

Well, if the alternative is stalling while waiting for the division to complete, you might as well use (some of) that time productively.

If you think about it, that’s the same behavior as when you’re relying on the CPU to raise a divide by zero exception.