The Alpha AXP, part 14: On the strange behavior of writes to the zero register

I noted early on that a special rule in the Alpha AXP is that an instruction that specifies the zero register as a destination is permitted to be completely optimized out by the processor (with two exception, as noted).

You might wonder what the point of this rule is. I mean, if you want the processor to not execute an instruction, then just don't write it!

Well, you might need to encode a nop instruction for alignment purposes, say, when falling through to the top of a hot loop, so that the loop starts at the beginning of a cache line. By giving the processor wide latitude in decide whether or not to ignore instructions which target the zero register, the architecture allows implementations to detect these instructions and simply remove them from the pipeline. Since it remains unspecified how many if any of the side effects of the instruction actually occur, the processor could pull it out of the pipeline at any stage of its execution, or before any of it executes at all.

The absence of a flags register means that the vast majority of instructions don't have a way to detect whether the arithmetic operation actually executed. The only one I can think of offhand is that you could use the /V version of an arithmetic operation to request a trap on overflow, perform an operation that intentionally overflows, and then see if a trap gets raised when you perform the TRAPB. But even then, that only tells you that the processor got as far as "raise an exception if an overflow was detected" step; you don't know whether it actually performed the calculation. (Though since the result is thrown away, it really doesn't matter either way.)

In practice, there's no need to detect this. In practice, you just write BIS zero, zero, zero for your nop instruction. If the processor optimizes it out, then great!

Exercise: What if you explicitly don't want your instruction to be optimized out? How could you express that?

Comments (20)
  1. Vilx- says:

    Well, as per the linked article, since the only instruction that is NEVER optimized out is the branching instruction, then I guess that the only way to create a non-optimized-out nop is to branch directly to the following instruction, using the zero-register as the return address receptacle. However I can also think of other nop-like constructs without side-effects that don’t involve the zero register (like copying a register to itself).

    1. Antonio Rodríguez says:

      x86’s “NOP” comes to mind. The x86 ISA doesn’t have an specific NOP opcode, so it’s tradition to use XCHG AX, AX (a valid instruction with no effects) as a NOP. The first processors in the line “executed” the instruction. Later ones treated this 1-byte opcode as a true NOP and ignored it.

      1. Peter Lund says:

        It certainly is treated as a real NOP in 64-bit mode, otherwise “XCHG EAX, EAX” would clear the upper 32 bits of RAX.

        There are semi-official encodings for multi-byte NOP instructions, as well. A long sequence of prefixes followed by 0x90 (NOP) seems to work well but it’s not the only option.

        1. Myria says:

          The x86 multi-byte nop instructions are official; they are listed in public Intel manuals as 0F 1F /0. The Intel manual’s description of “nop” also shows their preferred encodings of nop for 2 to 9 bytes. I assume that these are “recommended” because they are the byte patterns specifically recognized by the instruction decoder.

        2. Antonio Rodríguez says:

          IIRC, the first processor that treated it as a true NOP was the Pentium (or maybe the 80486?). Back then, x86 was a 16/32-bit hybrid architecture, where you had 32-bit registers and ALU, and the opcodes would affect the full 32 bits or only the lower 16 bits depending on the processor mode and the instruction prefix (if any). But in either case, “self-swapping” was an effective NOP, no matter wether you juggled with the lower 16 bits or the whole 32-bit register.

          I can’t find about when 0x90 became the official NOP. But when searching for it, I remembered Raymond already wrote about all of this (talking about NOPs in Windows 95, of course): .

  2. Joshua says:

    I think the JMP instruction is in a later part. I’d solve the exercise by JMP zero, +0.

  3. ZLB says:

    Well i’d say just use another register. Theres quite a few to play with!

    Other than that, a branch to next instruction would work, but I do wonder if there are side effects of a branch that could cause a stall even though it could never be mispredicted.

    1. Joshua says:

      My recall from college is that an unconditional jump doesn’t stall unless there is a cache miss.

  4. Peter Lund says:

    “Load into the zero register from memory” was my first thought but a quick look in the Alpha Architecture Reference Manual says that it can be optimized away as well.

    So, what about a dummy move of a normal register to itself? MOV rx, ry, which is really BIS rx, rx, ry?

    The Alpha had three standard NOP instructions: NOP/bis R31, R31, R31, UNOP/LDQ_U R31, 0(Rx), FNOP/CPYS F31, F31, F31. The AARM 4th ed doesn’t mention a NOP that won’t be optimized away but I dimly remember reading about such a NOP variant in the Alpha manuals twenty years ago.

    CPYS = Copy Sign for floating-point registers, btw.

  5. Someone says:

    “What if you explicitly don’t want your instruction to be optimized out?”

    Why would I? In the absence of flags, what side-effects such an instruction can have at all?
    Even if this would change the timing (of in-order execution) a little bit, you should not write anything that depends on this because it would be very very fragile.

    “XCHG EAX, EAX” would clear the upper 32 bits of RAX.”

    I thought ” AL, ” leaves all other bits of AX / EAX / RAX unchanged.

    1. Someone says:

      The blog software ate some parts: ” AL, ” should be “opcode AL, anything”

      1. Azarien says:

        Yes, but it’s different for EAX vs RAX. Most operations on EAX are sign-extended to RAX, modifying the upper 32 bits.

    2. smf says:

      “Why would I? In the absence of flags, what side-effects such an instruction can have at all?”


      1. Someone says:

        What time? The exercise was “What if you explicitly don’t want your instruction to be optimized out?” You want to be able to implement a something where this additional time is important?

        1. smf says:

          The question was “In the absence of flags, what side-effects such an instruction can have at all?”
          The answer is the time taken to execute the instruction. For example if you need a delay between multiple io accesses.

          Discarding nops at run time is normally a pointless optimisation, you’re spending gates on removing something that would only be put there for a reason. Even a CPU with branch delays should always have an instruction that can fill the delay slot, although MIPS ultimately added branches that didn’t always execute the following instruction instead.

          1. Someone says:

            “For example if you need a delay between multiple io accesses.” If code would depend on such timings it will break as soon as the CPU or the delay of the mounted RAM modules or some other parameter in the system is different. Or you would create a the same mess as the old DOS ganes which timed gameplay or sound or IO on CPU timing. I repeat: Why would someone want to consider the question of the exercise even a sensible one? You should just not care about a machine cycle or two at all.

          2. ChrisR says:

            When smf is talking about IO delays, they are probably talking about delays communicating with memory mapped IO devices, or a device on an SPI bus or something, where timing is important and getting it wrong (or having an instruction that is optimized out) will cause unreliable behavior. Of course you won’t usually need to worry about this when doing user space Win32 programming, but given that this series is about lower level concepts, it shouldn’t be a surprise that people will discuss things that are out of scope for regular Win32 programming.

          3. Fabian Giesen says:

            NOPs are commonly used to align the start address of hot loops, among other things. For fall-through, a few nops are generally preferable to an unconditional branch to the start of the loop. That makes them fairly common instructions, and many modern CPUs special-case them for that reason.

            At the other extreme, most in-order processors explicitly don’t discard NOPs. That’s because strategically inserted NOPs are a great way to work around CPU pipeline bugs :). Google “Cortex A53 Erratum 835769” or “SPARC LEON erratum NOP” for some fun, relatively recent examples.

  6. DWalker07 says:

    Will we get Raymond’s answer to the exercise?

    1. I would copy a nonzero register to itself.

Comments are closed.

Skip to main content