The Alpha AXP, part 10: Atomic updates to byte and word memory units


Today we're going to do a little exercise based on what we've learned so far. We learned how to perform byte and word loads and stores to memory. And we also learned how to perform atomic memory operations on longs and quads. But how about atomic memory operations on bytes and words?

We will have to put together what we've learned: Combine the byte and word access patterns with the atomic memory update pattern.

To recap: The sequence for reading an aligned word in memory goes like this:

    LDQ_U  t1, (t0)
    EXTWL  t1, t0, t1

The sequence for writing an aligned word in memory goes like this:

    LDQ_U   t5, (t0)                  ; t5 = yyBA xxxx
    INSWL   t1, t0, t3                ; t3 = 00ba 0000
    MSKWL   t5, t0, t5                ; t5 = yy00 xxxx
    BIS     t5, t3, t5                ; t5 = yyba xxxx
    STQ_U   t5, (t0)

    ; Byte sequence is the same, except you use INSBL and MSKBL

And the sequence for an atomic quad update goes like this:

retry:
    LDQ_L   t1, (t0)        ; load locked
    ... calculate new value of t1 based on old value ...
    STQ_C   t1, (t0)        ; store conditional
                            ; t1 = 1 if store was successful
    BEQ     t1, failed      ; jump if store failed
    ... continue execution ...

failed:
    BR      zero, retry     ; try again

What we need to do is insert the byte or word extraction, calculation, and insertion code where it says "calculate new value of t1 based on old value". The trick is that there is no LDQ_LU instruction. You can read for unaligned or you can read locked, but you can't read for unaligned locked.

Fortunately, this is easy to work around: We emulate the behavior of LDQ_U in software. Recall that LDQ_U is the same as LDQ except that it ignores the bottom 3 bits of the address. So let's mask out the bottom 3 bits of the address.

    ; atomically increment the word at the aligned address t0
    BIC     t3, #3, t0      ; force-align t0 to t3
retry:
    LDQ_L   t1, (t3)        ; load locked
    ... calculate new value of t1 based on old value ...
    STQ_C   t1, (t3)        ; store conditional
                            ; t1 = 1 if store was successful
    BEQ     t1, failed      ; jump if store failed
    ... continue execution ...

failed:
    BR      zero, retry     ; try again

Okay, we've successfully emulated the LDQ_LU and STQ_LU instructions. Now to do the extraction, calculation, and insertion:

    ; atomically increment the word at the aligned address t0
    BIC     t3, #3, t0      ; force-align t0 to t3
retry:
    LDQ_L   t1, (t3)        ; load locked
                            ; t1 = yyBA xxxx

    ; Extract
    EXTWL   t1, t3, t2      ; t2 = 0000 00BA (the word value)

    ; Calculate
    ADDL    t2, #1, t2      ; increment t2

    ; Insert
    INSWL   t2, t0, t2      ; t2 = 00ba 0000
    MSKWL   t1, t0, t1      ; t1 = yy00 xxxx
    BIS     t1, t2, t1      ; t1 = yyba xxxx

    STQ_C   t1, (t3)        ; store conditional
                            ; t1 = 1 if store was successful
    BEQ     t1, failed      ; jump if store failed
    ... continue execution ...

failed:
    BR      zero, retry     ; try again

Fortunately, our extraction, calculation, and insertion could be performed in under 20 instructions with no additional memory access, and no use of potentially-emulated instructions, so it all fits between the LDQ_L and STQ_C.

Exercise: What could we do if our calculation required additional memory access or required more than 20 instructions?

Comments (4)
  1. Torsten says:

    For the exercise: Write a tiny loop with LDQ_L and STQ_C that only does a compare-and-swap for aligned quads. Then build an outer loop that loads the old value, does whatever calculations it needs, and does the compare-and-swap as its final step.

    1. Joshua says:

      I’m pretty sure the tiny loop isn’t a loop but this seems to be the general idea.

      1. Ben Voigt says:

        It’s more efficient if it is. If the store conditional failed spuriously and the comparison is still satisfied, there’s no need to repeat the (expensive) calculation step.

        1. Martin Bonner says:

          It’s only more efficient if spurious failures are relatively common. I don’t know if that’s true or not.

Comments are closed.

Skip to main content