The Alpha AXP, part 4: Bit 15. Ugh. Bit 15.


Let's load each of the powers of two up to 2³¹, in canonical form.

    LDA  Ra, 0x0001(zero)   ;   Ra = 0x00000000`00000001 ; bit  0
    LDA  Ra, 0x0002(zero)   ;   Ra = 0x00000000`00000002 ; bit  1
    LDA  Ra, 0x0004(zero)   ;   Ra = 0x00000000`00000004 ; bit  2
    LDA  Ra, 0x0008(zero)   ;   Ra = 0x00000000`00000008 ; bit  3
    LDA  Ra, 0x0010(zero)   ;   Ra = 0x00000000`00000010 ; bit  4
    LDA  Ra, 0x0020(zero)   ;   Ra = 0x00000000`00000020 ; bit  5
    LDA  Ra, 0x0040(zero)   ;   Ra = 0x00000000`00000040 ; bit  6
    LDA  Ra, 0x0080(zero)   ;   Ra = 0x00000000`00000080 ; bit  7
    LDA  Ra, 0x0100(zero)   ;   Ra = 0x00000000`00000100 ; bit  8
    LDA  Ra, 0x0200(zero)   ;   Ra = 0x00000000`00000200 ; bit  9
    LDA  Ra, 0x0400(zero)   ;   Ra = 0x00000000`00000400 ; bit 10
    LDA  Ra, 0x0800(zero)   ;   Ra = 0x00000000`00000800 ; bit 11
    LDA  Ra, 0x1000(zero)   ;   Ra = 0x00000000`00001000 ; bit 12
    LDA  Ra, 0x2000(zero)   ;   Ra = 0x00000000`00002000 ; bit 13
    LDA  Ra, 0x4000(zero)   ;   Ra = 0x00000000`00004000 ; bit 14
    LDAH Ra, 0x0001(zero)   ; \
    LDA  Ra, 0x8000(Ra)     ; / Ra = 0x00000000`00008000 ; bit 15
    LDAH Ra, 0x0001(zero)   ;   Ra = 0x00000000`00010000 ; bit 16
    LDAH Ra, 0x0002(zero)   ;   Ra = 0x00000000`00020000 ; bit 17
    LDAH Ra, 0x0004(zero)   ;   Ra = 0x00000000`00040000 ; bit 18
    LDAH Ra, 0x0008(zero)   ;   Ra = 0x00000000`00080000 ; bit 19
    LDAH Ra, 0x0010(zero)   ;   Ra = 0x00000000`00100000 ; bit 20
    LDAH Ra, 0x0020(zero)   ;   Ra = 0x00000000`00200000 ; bit 21
    LDAH Ra, 0x0040(zero)   ;   Ra = 0x00000000`00400000 ; bit 22
    LDAH Ra, 0x0080(zero)   ;   Ra = 0x00000000`00800000 ; bit 23
    LDAH Ra, 0x0100(zero)   ;   Ra = 0x00000000`01000000 ; bit 24
    LDAH Ra, 0x0200(zero)   ;   Ra = 0x00000000`02000000 ; bit 25
    LDAH Ra, 0x0400(zero)   ;   Ra = 0x00000000`04000000 ; bit 26
    LDAH Ra, 0x0800(zero)   ;   Ra = 0x00000000`08000000 ; bit 27
    LDAH Ra, 0x1000(zero)   ;   Ra = 0x00000000`10000000 ; bit 28
    LDAH Ra, 0x2000(zero)   ;   Ra = 0x00000000`20000000 ; bit 29
    LDAH Ra, 0x4000(zero)   ;   Ra = 0x00000000`40000000 ; bit 30
    LDAH Ra, 0x8000(zero)   ;   Ra = 0xFFFFFFFF`80000000 ; bit 31

Bit 15. Ugh. Bit 15.

It's the only one that requires two instructions to calculate the 32-bit mask in canonical form.

This means that if you have a bitfield, bit 15 is slightly more expensive than the others.

Bonus chatter: In an internal discussion of byte manipulation on the Alpha AXP, I happened to have written "Bit 15. Ugh. Bit 15." Gov Maharaj (co-host of The Defrag Show and application compatibility master extraordinaire) told me, "That needs to be the title of a blog post." So here it is.

Comments (25)

  1. Brian_EE says:

    If the processor had a “one” register, these could have all been shift operations.

  2. pc says:

    Clearly it is really designed to have constants loaded from a constant pool or the like rather than being constructed from code. I pretty much never have to deal with this level of abstraction, though I find it interesting, so this is intended as a completely serious question: Is going through all these ways of loading constants from code just a fun exercise because it’s neat, or is there a reason to do it this way rather than just loading memory from a constant pool? It seems that by the time you’re using multiple instructions to get a value in a register that it has to be easier/faster/etc. to just use one instruction and one constant-pool memory location. But I’m assuming I must be missing something, and I certainly haven’t done any performance testing or the like.

    1. laonianren says:

      Reading constants from memory doesn’t make the problem go away. You need to load the address of the constant before you can load the constant, and that’s the same problem. Plus now you’ve got extra code in your compiler to manage the constant pool.

      The Alpha was superscalar meaning it could execute more than one instruction in a single clock cycle. A canonical two instruction sequence for loading a 32-bit constant is something that could easily be recognised by the processor and run in a single cycle. I don’t know if the Alpha had this particular optimisation, but it’s the kind of improvement that could have happened if the design had been continued.

      In contrast, memory access is spectacularly slow. Even if the value is available in L1 cache it will still take several cycles to read it. If not, it could take hundreds. Or millions if the memory needs to be paged in.

      1. pc says:

        Thanks; that’s starting to make more sense. I guess I assumed that the “load memory” instructions could do something more clever (especially if there’s something like the x86’s “segment registers” so that one could specify loads from a particular place in memory more easily). I can see how at the end of the day, specifying addresses of places to load is roughly the same as specifying the constants in the first place. Does feel like it should only need a couple 32-bit-immediate-load instructions, though (maybe a dedicated instruction to a dedicated register so you don’t have the problem of both needing bits to specify the register as well as the value to load?). I’m sure there were good engineering reasons to do things the way they did, though.

        I was thinking of “reading another couple instructions to get the value” to be roughly equivalent in terms of cache/memory/etc. access as “reading a value from some other place in memory”, particularly if the processor has a big enough pipeline to start requesting the memory before it really needs it. I suppose it wouldn’t be that simple in practice, either, though.

        1. Evan says:

          Traditionally, loading from constant pools like that would be done by putting the constants near the instruction and using a PC-relative address. Loading the address isn’t really a problem. (I don’t know that’s true on Alpha, but I would be very surprised if it’s not.)

          The bigger problem is it doesn’t really save you anything. You save an instruction, but you’ve still got the memory access and reduced cache density compared to if it fit into one instruction. But now the constant is separate from the instruction which increases the chance of a cache miss and all that other jazz mentioned above.

          1. We’ll learn next week that the Alpha does not support PC-relative addressing.

          2. Joshua says:

            Raymond: I think I finally figured it out. Try this sequence:

            BSR at, +1
            32 bit constant
            LDL (at), rx

            BSR at, +2
            64 bit constant
            LDQ (at), rx

            Where this code depends on the Alpha processor not predicting returns based on a call graph (I don’t see how it can as there is no ret instruction). The linker trampoline depends on BSR is relative so …

          3. The Alpha AXP does have a return address predictor, as we’ll see tomorrow. It also has a split I-D cache, so loading data from the instruction stream is as slow as any other data access. Though I’m impressed that you optimized two instructions down to three.

          4. Evan says:

            I am very surprised! I look forward to seeing the post about that missing PC-relative addressing mode!

          5. Les Ismoore says:

            –” I’m impressed that you optimized two instructions down to three”
            In the true nature of RISC.

      2. Matteo Italia says:

        On ARM constant pools are generally placed near the code that uses them (and I’m pretty sure I’ve seen similar stuff on compiled x86_64 code with double FP constants); given that you are currently executing code from that region of memory, surely it’s in some cache (at least, x86 does have separated L1 instruction and data cache, but from L2 down they are shared).

    2. smf says:

      >Clearly it is really designed to have constants loaded from a constant pool or the like rather than being constructed from code.

      No, it’s clearly designed to have constants loaded from code. However like all cpus with the RISC designation, there are compromises that only make sense when considered as a whole.

      Making all instructions the same length means you can’t load every constant in a single instruction, but programmes don’t spend all of their time loading constants. On the other hand it simplified the rest of the chip and saved space on the die so that all instructions will be quick.

      The original designers wouldn’t understand why 1 out of 32 single bit loads needing two instructions rather than one, was such a big deal. I would agree with them.

      The Alpha was based on an earlier RISC processor called PRISM which was led by Dave Cutler, he left and joined Microsoft when the project was cancelled. I’ve not looked to see whether PRISM suffered from the same thing, but it would be ironic if Raymonds disgust was caused by someone senior to him at Microsoft :-)

  3. Rick C says:

    Just to be clear, that circumlocution is to avoid an automatic sign-extension, right?

    1. Pierre B. says:

      The displacement is a *signed* 16-bits value. Indeed, to add a signed value, it has to be sign-extended. It’s a bit confusing here because hexadecimal constants are used, which hides that 0x8000 is really negative.

  4. Pierre B. says:

    Honestly, I find that bit 31 is unsettling. All other set a single bit. I understand that with 32-bits colored glasses it is correct, but I would not be surprised if it sometimes caused subtle issues?

    1. skSdnW says:

      SRL might be problematic.

      1. Pierre B. says:

        You’re right, but as other pointed out, the canonical form is always sign-extended, so the compiler always has to assume the high part of the register might contain set bit, so unsigned right shifts have to be synthetised with more than one instruction to mask the unwanted bits shifting in.

        I guess it makes sense to always sign-extend for sanity reason and the compiler knows how to compensate every corner cases. They must have had a test harness with all possible pair of operations and check to verify the expected 32-bits results and that canonical form was always resotred at the end.

    2. pc says:

      Raymond calls this “canonical form” in Part 1, though I’m not yet clear on what problems would ensue if one used 32-bit operations on a value that wasn’t in canonical form.

  5. Matteo Italia says:

    On other RISC assemblers I know that there are pseudo instructions that load 32 bit constants in a “smart” way – for example, on ARM there’s the LDR reg, imm that gets exploded into a mov/mvn if it can be expressed as an 8 bit constant + rotate + (possibly) a negation, or puts the value into a literal pool and becomes the corresponding LDR reg, mem.

    Did you have anything like this in your Alpha assembler or you had to deal with these oddities by hand?

    1. I never wrote Alpha assembly. I only had to debug it, and the disassembler shows the raw instruction. It doesn’t try to “prettify” them. You can read the assembler documentation to see what synthetic instructions are available.

      1. Joshua says:

        Not to worry.

        On a semi-related note; I confused people when I referred to an instruction as JMP +4. It makes a lot of sense when disassembling but most assemblers don’t get it (it means 4 relative to next instruction rather than 4 absolute).

      2. Matteo Italia says:

        Uh, right, of course. That’s actually kinda what happened to me with x86 – having learned it by debugging, I don’t know most of the assemblers’ tricks, my knowledge is limited to the actual ISA.

        By the way, having a quick glance at the instructions I didn’t find anything like a “magic load” (although it’s possible I didn’t look hard enough), but I managed to find a “div” that has the caveat that destroys some “t” registers – I’d be curious to see if it generates the “integer multiply by inverse” algorithm.

        1. laonianren says:

          This is some kind of magic load: https://msdn.microsoft.com/en-us/library/aa226676(v=vs.60).aspx

          Sadly, the documentation doesn’t describe exactly what it does, though it’s probably an LDA/LDAH pair. I’d give it a go, but the computer I could have tried it on is about fifteen years in the past!

  6. ErikF says:

    Couldn’t the compiler use 64-bit loads for unsigned integers, to avoid the canonicalization issues?

    1. smf says:

      You can’t do a 64bit load in a single instruction.

Skip to main content