The Alpha AXP, part 7: Memory access, loading unaligned data

Last time, we look ed at loading aligned memory. Now we're going to look at unaligned data.

Let's load an unaligned quad. The unaligned quad will span two aligned quads, so we will need to load two quads, extract the pieces, and merge them together.

    LDQ_U   t1, (t0)    ; load lower container ; t1 = FEDC BA??
    LDQ_U   t2, 7(t0)   ; load upper quad      ; t2 = ???? ??HG
    EXTQL   t1, t0, t1  ; align lower portion  ; t1 = 00FE DCBA
    EXTQH   t2, t0, t2  ; align upper portion  ; t2 = HG00 0000
    BIS     t1, t2, t1  ; combine              ; t1 = HGFE DCBA

In the case where the value happens to have been aligned by sheer luck, the operation still works as intended. They do a bunch of redundant work (because they are dealing with a misalignment that never happened), but you still get the correct result.

    LDQ_U   t1, (t0)    ; load lower container ; t1 = HGFE DCBA
    LDQ_U   t2, 7(t0)   ; load upper quad      ; t2 = HGFE DCBA
    EXTQL   t1, t0, t1  ; align lower portion  ; t1 = HGFE DCBA
    EXTQH   t2, t0, t2  ; align upper portion  ; t2 = HGFE DCBA
    BIS     t1, t2, t1  ; combine              ; t1 = HGFE DCBA

A similar pattern exists for unaligned longs. Longs require an extra step to ensure the result is in canonical form.

    LDQ_U   t1, (t0)    ; load lower container ; t1 = BA?? ????
    LDQ_U   t2, 3(t0)   ; load upper quad      ; t2 = ???? ??DC
    EXTLL   t1, t0, t1  ; align lower portion  ; t1 = 0000 00BA
    EXTLH   t2, t0, t2  ; align upper portion  ; t2 = 0000 DC00
    BIS     t1, t2, t1  ; combine              ; t1 = 0000 DCBA
    ADDL    t1, zero, t1; put in canonical form; t1 = ssss DCBA

And you can probably guess the pattern for unaligned words:

    LDQ_U   t1, (t0)    ; load lower container ; t1 = A??? ????
    LDQ_U   t2, 1(t0)   ; load upper quad      ; t2 = ???? ???B
    EXTWL   t1, t0, t1  ; align lower portion  ; t1 = 0000 000A
    EXTWH   t2, t0, t2  ; align upper portion  ; t2 = 0000 00B0
    BIS     t1, t2, t1  ; combine              ; t1 = 0000 00BA

If you need sign extension for the unaligned word, then you can use the trick we saw last time.

    LDQ_U   t1, (t0)    ; load lower container     ; t1 = A??? ????
    LDQ_U   t2, 1(t0)   ; load upper quad          ; t2 = ???? ???B
    LDA     t3, 2(t0)   ; sneaky trick to extract at index 6+7
    EXTQL   t1, t3, t1  ; align lower portion high ; t1 = 0A?? ????
    EXTQH   t2, t3, t2  ; align upper portion high ; t2 = B000 0000
    BIS     t1, t2, t1  ; combine                  ; t1 = BA?? ????
    SRA     t1, #48, t1 ; shift right with sign    ; t1 = ssss ssBA

Exercise: There's an obvious continuation of this pattern for unaligned bytes, so why doesn't anybody use it?

That's it for loading bytes, words, and unaligned data from memory. Next time, we'll start looking at writing them, which is a lot more complicated.

Bonus chatter: Later versions of the Alpha AXP processor added support for byte reads and writes, as well as aligned word reads and writes. This makes code easier to write, but probably makes the store-to-load forwarder logic much harder.

Comments (11)
  1. Damien says:

    Your comments to the right of the code, starting from the long example seem to be too small in terms of the good components vs ?s. Indeed, by the word example, we do seem to be reading an unaligned byte.

    1. If there are eight characters representing a 64-bit value, then each character must represent a byte.

      1. pc says:

        Yes. Obvious once one looks at it, but programmers are used to treating beginning-of-alphabet values as Hex nibbles, so it took me a second glance at least to realize that they were just placeholders for byte values.

  2. pc says:

    Exercise answer: Single bytes rarely span more than one word.

  3. camhusmj38 says:

    I imagine that the store to load forwarding issue is less severe on ALPHA compared to x86 – as there are no memory operands.

    1. camhusmj38 says:

      Also I think because ALPHA has a weak memory model so it can reorder memory operations more freely

      1. Fabian Giesen says:

        Not even the Alpha allows a single core to observe its own stores out of order. The Alpha’s weak memory model affects visibility of memory operations to external agents (other cores, devices on the bus etc.), not to the core itself.

        Pretty much every architecture short of explicitly exposed-pipeline VLIWs (very uncommon outside of DSPs and special-purpose devices) requires any earlier store in program order on the same core to be visible to later load instructions, for the simple reason that not doing this effectively makes a particular store-to-load latency part of the architecture spec, greatly complicating all later attempts to change the microarchitecture of load/store units.

        Memory operands or not doesn’t really have an impact here; whether you allow memory operands to non-load/store instructions affects the complexity of your instruction decoding and internal scheduling (and also makes a difference for code density), but it does not substantially change the design of the load/store units themselves.

        One final remark: current x86s reorder memory operations veryfreely. The x86 memory model doesn’t say that CPUs are not allowed to reorder loads or stores quite freely; it says they’re not allowed to get caught doing it by an outside observer, which is a substantial and very important difference. :)

        What x86s end up doing is execute loads and stores whenever it suits them, but a) internally keep track of their dependencies on the current core (this is just a necessary ingredient for any core that does out-of-order execution and not specific to x86), b) keep track of any external events (loads and stores to addresses that there are pending operations for) that might result in the internal reordering becoming visible to the outside. x86s are only on their best behavior when they know someone is watching. :) CPUs for other architectures also do this, but only need to enforce weaker rules, and therefore need to say “nuh-uh, that sequence of events violates the rules, cancel and try that again” less often.

        x86s more stringent rules make the internal dependency tracking harder to design and validate, but no matter what the memory model is, the vast majority of memory operations executed doesn’t have any external agents loading from or storing to the same cache line while they’re in-flight, and hence can be freely reordered without anyone being the wiser.

        1. camhusmj38 says:

          Thank you very much for your excellent explanation.

  4. Klimax says:

    Frankly, so far all articles on Alpha made me glad that I don’t have to deal with it at all on any level.

    So much extra code…

    1. To be fair, it’s all stuff that the compiler would’ve abstracted away anyway. Manually assembling Windows code is not really a recommended practice for developing software.

      1. smf says:

        But still stuff everyone involved in bringing up a system would need to know about.

Comments are closed.

Skip to main content