The MIPS R4000, part 1: Introduction

Continuing in the "Raymond introduces you to a CPU architecture that Windows once supported but no longer does" sort-of series, here we go with the MIPS R4000.

The MIPS R4000 implements the MIPS III architecture. It is a 64-bit processor, but Windows NT used it in 32-bit mode. I'll be focusing on the aspects of the processor relevant to debugging user-mode programs on Windows NT. This means that I may skip over various technical details on the assumption that the compiler knows what the rules are and won't (intentionally) generate code that violates them.

Throughout, I will say "MIPS" instead of "MIPS III architecture". Some of the issues do not apply to later versions of the architecture family, but I am focusing on MIPS III since that's what Windows NT used.

The MIPS is a RISC-style load-store processor: The only operations you can perform with memory are load and store. There is no "add value to memory" instruction, for example. Each instruction is 32 bits wide, and the program counter must be on an exact multiple of 4.

The processor can operate in either little-endian or big-endian mode; Windows NT uses little-endian mode, and even though some instructions change behavior depending on whether the processor is in big-endian or little-endian mode, I will discuss only the little-endian case.

The architectural terminology for a 32-bit value is a word (w), and a 16-bit value is a halfword (h). There's also doubleword (d) for 64-bit values, but we won't see it here because we are focusing on the 32-bit mode of the processor.

The MIPS has 32 general-purpose integer registers, formally known as registers $0 through $31, but which conventionally go by these names:

Register Mnemonic Meaning Preserved? Notes
$0 zero reads as zero Immutable Writes are ignored
$1 at assembler temporary Volatile Helper for synthesized instructions
$2 v0 value No On function exit, contains the return value
$3 v1 value No High 32 bits of return value (for 64-bit values)
$4$7 a0a3 argument No On function entry, contains function parameters
$8$15 t0t7 temporary No
$16$23 s0s7 saved Yes
$24$25 t8t9 temporary No
$26$27 k0k1 kernel No access Reserved for kernel use
$28 gp global pointer Yes Not used by 32-bit code
$29 sp stack pointer Yes
$30 s8 frame pointer Yes For functions with variable-sized stacks
$31 ra return address Maybe

The zero register reads as zero, and writes to it are ignored.

The k0 and k1 registers are reserved for kernel use, and no well-written user-mode program will use them.¹

Win32 requires that the sp and s8 registers be used for their stated purpose throughout the entire function. If a function does not have a variable-sized stack frame, then it can use s8 for any purpose (which is why the disassembler calls it s8 instead of fp, I guess). And since 32-bit code doesn't ascribe special meaning to gp, then it too can be used for any purpose, provided its value is preserved across the call. In practice the Microsoft compiler merely avoids the gp register completely, and it uses the s8 register only as a frame pointer.

The stack is always aligned on an 8-byte boundary, and there is no red zone.

Some registers have stated purposes only at entry to a function or exit from a function. When not at the function boundary, those registers may be used for any purpose.

Register marked with "Yes" in the "Preserved" column must be preserved across the call; those marked "No" do not.

The ra register is marked "Maybe" because you don't normally need to preserve it. However, if you are a leaf function that does not modify any preserved registers (not even sp), then you can skip the generation of unwind codes for the leaf function, but you must keep the return address in ra for the duration of your function so that the operating system can unwind out of the function should an exception occur. (Special rules for lightweight leaf functions also exist for Itanium, Alpha AXP, and x64.)

The at register is volatile because the assembler can use it for various invisible purposes, primarily for synthesizing missing instructions. We'll see examples of this as we go.

There are also two special-purpose integer registers, called HI and LO. These are used by multiplication and division instructions, and we'll cover them when we get to multiplication and division.

There are 32 single-precision (32-bit) floating point registers, which can be paired up to form 16 double-precision (64-bit) floating point registers. When a pair is used to operate on a single-precision value, the lower-numbered register holds the value, and the higher-numbered register is not used. (Indeed, the value in the higher-numbered register will be garbage.) So I guess you really have just 16 single-precision floating point registers, since the odd-numbered ones are basically useless.

Register(s) Meaning Preserved? Notes
$f0/$f1 return value No
$f2/$f3 second return value No For imaginary component of complex number.
$f4/$f5$f10/$f11 temporary No
$f12/$f13$f14/$f15 arguments No
$f16/$f17$f18/$f19 temporary No
$f20/$f21$f30/$f31 saved Yes

Floating point support is optional. If not supported, floating point instructions will trap into the kernel, and the kernel is expected to emulate the instruction.

There is not a lot of floating point in typical systems programming, so I won't cover it except when discussing the calling convention later.

There is no flags register. Hopefully you don't find this weird any more, seeing as we already encountered this with the Alpha AXP.

The 32-bit address space is split down the middle between user-mode and kernel-mode. The kernel-mode space is further split: Half of the kernel-mode address space is dedicated to mapping physical addresses (the lowest 512MB² gets mapped twice, once cached and once uncached), leaving only 1GB for the operating system. This partitioning is architectural; you don't get a choice in the matter.

Okay, we'll begin next time by looking at 32-bit integer calculations.

¹ I know you're wondering what happens if poorly-written user-mode code tries to use them. The answer is that user-mode code can modify the register all it wants, but the value read back may not be equal to value last written. As far as user mode is concerned, it's basically a black hole register that reads as garbage. This makes it even more useless than the zero register, which is a black hole register that at least reads as zero. (Internally, the registers are used by kernel mode as scratch variables during interrupt and exception handling.)

² I guess they figured that if you had more than 512MB of RAM, you'd have switched to a 64-bit operating system.

Comments (30)
  1. I look forward to your explaining the branch delay slot (and the branch if likely instructions if Windows used them). That will be “fun”!

    1. Joshua says:

      Maybe Raymond will explain what happens if you get interrupt in the branch delay slot. I got a dumb/wrong answer in grad school.

      1. If you get a trap (interrupt, TLB miss, whatever) in a delay slot, the processor rolls back to the beginning of the branch instruction, so afterwards the branch gets restarted. There’s a flag in one of the supervisor status registers that tells the kernel what happened if it needs to know. (Which it might, e.g. for emulation of floating point instructions.)

        1. Joshua says:

          Which nicely explains why a conditional branch in a delay slot is undefined despite appearing to work (my Google-fu has limits).

  2. kantos says:

    I presume the calculus for not doing 64bit support at the time was that the ram wasn’t available and the extra memory needed for the 64bit structures wasn’t worth it?

    1. Also, you probably don’t want to make developers (both Microsoft and ISV) do two really hard things to a code base at the same time: Port to a radically different architecture + port to a new pointer size.

      1. Fabian Giesen says:

        Rationale for why MIPS added 64-bit support when they did straight from the horse’s mouth:

      2. kantos says:

        The snarky programmer in me says “But if they had been writing standards compliant code in the first place!” but the realist knows that’s never the case in reality. Moreover I know the windows ABI changed between x86 and x64 as well.

        1. The x86 and x64 share a lot of architectural properties, like “atomic read-modify-write operations”, “extremely forgiving of misalignment” and “has an architectural stack pointer!” This means that nearly all of your porting issues are due to the pointer size, not the architecture behaving in fundamentally different ways from what you are used to.

          1. kantos says:

            In all honestly I don’t think intptr_t and uintptr_t even existed at the time this port would have been made anyway. I think the assumption was that long would manage that. I suspect the reason both exist in the first place is because of MS’ struggles with LONG when porting windows.

  3. Matteo Italia says:

    “Internally, the registers are used by kernel mode as scratch variables during interrupt and exception handling” I suppose that this means that the kernel had to be careful not to leak reserved information through these registers? Or they were just used for plumbing about the current process, so no reserved stuff is handled there at all?

    1. Aged .Net Guy says:

      I’m betting that concerns for co-executing malware and information leakage were not nearly so prevalent then as now.

      1. Fabian Giesen says:

        The thing k0/k1 were intended for was the TLB miss handler. (MIPS R4000 has software-managed TLBs; not sure if this is still true for current MIPS designs.)

        Since this is a TLB miss and thus more likely/frequent than a “true” page fault, performance was a concern, and burning 2 out of 32 registers on it was considered a worthwhile trade-off. Other interrupts and exceptions took a different route that was considered less critical, so saving a few registers, switching to a kernel stack, etc. was acceptable overhead there. Some notes and history here: – search for “UTLBMISS”.

        I’m not sure whether OS kernels at the time bothered with clearing k0/k1 before returning; they potentially contain information about physical addresses and contents of user-mode page table entries, which is now (a couple decades of ingenious exploits later…) unambiguously considered a security risk but might not have been then. Either way, clearing the two registers would add 2 instructions at the end of the handler, which could probably be smuggled into the already-present branch/coprocessor delay slots. (See the two NOPs at the end of the code fragment in the post I linked to.)

        1. Matteo Italia says:

          Really interesting, thank you for the explanation.

  4. Yukkuri says:

    Ah I love this series. It is very interesting how these details differ for different chips

  5. skSdnW says:

    I always enjoy these CPU series but at the end I’m always glad that x86 won.

  6. nathan_works says:

    oh thing brings back memories of intro-to-arch classes and xspim (the unix mips emulator)..

  7. quiret says:

    A big personality at MSFT really did not want to optimize for the x86 architecture so he kept NT compatible to MIPS as big reminder: NT gotta run on everything that computes :)

  8. GL1zdA says:

    I’ve read in “Inside Windows NT” that the MIPS was the reason why the address space is split this way on NT. But was NT compatibility the reason why Windows 95’s address space is split the same way (so that effectively Windows 95 memory partitioning was made MIPS compatible)?

  9. Euro Micelli says:

    So, when you said this week would be “boring topics”, that was obviously an April Fool’s joke.

  10. Johan Thelin says:

    I just have to say that I really love these ISA series. I actually spent a semester at university implementing something very similar to the MIPS architecture back in the early 2000 – the JAM CPU.

  11. DWalker07 says:

    It’s kind of cool to have a zero register that you can always read from.

    1. Erik F says:

      That’s also the place where you can return bits to the processor once you’ve used them. It’s really quite a useful register!

    2. James says:

      It’s also really useful as a prefetch hint but I can’t remember if MIPS supported that.

  12. Marvy says:

    I have no idea what you mean by a separate cached and uncached mapping

    1. ChrisR says:

      A mapping in this context is used by the processor to translate from physical addresses to virtual addresses. It also has some flags, one of which specifies whether accesses to the memory are cached or not. Having an uncached block of memory is useful if you want to give it to a device to perform DMA to/from. A cached block is more useful if you want to use the CPU to read/write the block.

      1. Marvy says:

        So “cached” is in the sense of “should I use the L2 cache”?
        Makes sense I guess. Thanks!

        1. ChrisR says:

          Correct, but the cached flag means any CPU cache, not just the L2 cache. Writing to non-cached memory with the CPU is much slower since every byte you read/write has to go out to the physical memory. That’s why you normally use cached mappings, and in a typical program you never really think about this because all your memory is always cached and set up by the OS. However, using cached memory makes using DMA more complex since you have to invalidate and/or flush the cache whenever you use the CPU to read/write the memory or you will get inconsistent results.

          I should have also specified that the mapping is for translating from virtual to physical addresses as well, which from a typical programming perspective is probably the more useful way to think about it.

          1. Marvy says:

            Thanks. I’m so used to cached memory it never occurred to me until I read your comment that non-cached is even an option!
            But it makes sense: if the CPU is not the only one using the memory, than having the CPU cache things is just asking for trouble.

Comments are closed.

Skip to main content