Today we're going to do a little exercise based on what we've learned so far. We learned how to perform byte and word loads and stores to memory. And we also learned how to perform atomic memory operations on longs and quads. But how about atomic memory operations on bytes and words?
We will have to put together what we've learned: Combine the byte and word access patterns with the atomic memory update pattern.
To recap: The sequence for reading an aligned word in memory goes like this:
LDQ_U t1, (t0)
EXTWL t1, t0, t1
The sequence for writing an aligned word in memory goes like this:
LDQ_U t5, (t0) ; t5 = yyBA xxxx
INSWL t1, t0, t3 ; t3 = 00ba 0000
MSKWL t5, t0, t5 ; t5 = yy00 xxxx
BIS t5, t3, t5 ; t5 = yyba xxxx
STQ_U t5, (t0)
; Byte sequence is the same, except you use INSBL and MSKBL
And the sequence for an atomic quad update goes like this:
retry:
LDQ_L t1, (t0) ; load locked
... calculate new value of t1 based on old value ...
STQ_C t1, (t0) ; store conditional
; t1 = 1 if store was successful
BEQ t1, failed ; jump if store failed
... continue execution ...
failed:
BR zero, retry ; try again
What we need to do is insert the byte or word extraction,
calculation, and insertion code
where it says "calculate new value of t1 based on old value".
The trick is that there is no LDQ_LU instruction.
You can read for unaligned or you can read locked, but you can't read
for unaligned locked.
Fortunately, this is easy to work around: We emulate the behavior
of LDQ_U in software.
Recall that LDQ_U is the same as LDQ
except that it ignores the bottom 3 bits of the address.
So let's mask out the bottom 3 bits of the address.
; atomically increment the word at the aligned address t0
BIC t3, #3, t0 ; force-align t0 to t3
retry:
LDQ_L t1, (t3) ; load locked
... calculate new value of t1 based on old value ...
STQ_C t1, (t3) ; store conditional
; t1 = 1 if store was successful
BEQ t1, failed ; jump if store failed
... continue execution ...
failed:
BR zero, retry ; try again
Okay, we've successfully emulated the LDQ_LU and
STQ_LU instructions.
Now to do the extraction, calculation, and insertion:
; atomically increment the word at the aligned address t0
BIC t3, #3, t0 ; force-align t0 to t3
retry:
LDQ_L t1, (t3) ; load locked
; t1 = yyBA xxxx
; Extract
EXTWL t1, t3, t2 ; t2 = 0000 00BA (the word value)
; Calculate
ADDL t2, #1, t2 ; increment t2
; Insert
INSWL t2, t0, t2 ; t2 = 00ba 0000
MSKWL t1, t0, t1 ; t1 = yy00 xxxx
BIS t1, t2, t1 ; t1 = yyba xxxx
STQ_C t1, (t3) ; store conditional
; t1 = 1 if store was successful
BEQ t1, failed ; jump if store failed
... continue execution ...
failed:
BR zero, retry ; try again
Fortunately, our extraction, calculation, and insertion
could be performed in under 20 instructions
with no additional memory access,
and no use of potentially-emulated instructions,
so it all fits between the
LDQ_L and STQ_C.
Exercise: What could we do if our calculation required additional memory access or required more than 20 instructions?
For the exercise: Write a tiny loop with LDQ_L and STQ_C that only does a compare-and-swap for aligned quads. Then build an outer loop that loads the old value, does whatever calculations it needs, and does the compare-and-swap as its final step.
I’m pretty sure the tiny loop isn’t a loop but this seems to be the general idea.
It’s more efficient if it is. If the store conditional failed spuriously and the comparison is still satisfied, there’s no need to repeat the (expensive) calculation step.
It’s only more efficient if spurious failures are relatively common. I don’t know if that’s true or not.