The Alpha AXP, part 8: Memory access, storing bytes and words and unaligned data

Storing a byte and word requires a series of three operations: Read the original data, modify the original data to incorporate the byte or word, then write the modified data back to memory.

To assist with the modification are two groups of instructions known as insertion and masking.

```    INSBL   Ra, Rb/#b, Rc  ; Rc =  (uint8_t)Ra << (Rb/#b * 8 % 64)
INSWL   Ra, Rb/#b, Rc  ; Rc = (uint16_t)Ra << (Rb/#b * 8 % 64)
INSLL   Ra, Rb/#b, Rc  ; Rc = (uint32_t)Ra << (Rb/#b * 8 % 64)
INSQL   Ra, Rb/#b, Rc  ; Rc = (uint64_t)Ra << (Rb/#b * 8 % 64)

INSWH   Ra, Rb/#b, Rc  ; Rc = (uint16_t)Ra >> ((64 - Rb/#b * 8) % 64)
INSLH   Ra, Rb/#b, Rc  ; Rc = (uint32_t)Ra >> ((64 - Rb/#b * 8) % 64)
INSQH   Ra, Rb/#b, Rc  ; Rc = (uint64_t)Ra >> ((64 - Rb/#b * 8) % 64)
```

These are the inverse of the extraction instructions. Instead of extracting data from a 128-bit value, they move the data into position within a 128-bit value. For example, here's a diagram of inserting the long `FGHI` into a 128-bit value:

```    high part  low part
--------- ---------
0000 0FGH           -- INSLH
I000 0000 -- INSLL
```

The last piece of the puzzle is the masking instructions.

```    MSKBL   Ra, Rb/#b, Rc  ; Rc = Ra & ~( (uint8_t)~0 << (Rb/#b * 8 % 64))
MSKWL   Ra, Rb/#b, Rc  ; Rc = Ra & ~((uint16_t)~0 << (Rb/#b * 8 % 64))
MSKWL   Ra, Rb/#b, Rc  ; Rc = Ra & ~((uint32_t)~0 << (Rb/#b * 8 % 64))
MSKWL   Ra, Rb/#b, Rc  ; Rc = Ra & ~((uint64_t)~0 << (Rb/#b * 8 % 64))

MSKWH   Ra, Rb/#b, Rc  ; Rc = Ra & ~((uint16_t)~0 >> ((64 - Rb/#b * 8) % 64))
MSKWH   Ra, Rb/#b, Rc  ; Rc = Ra & ~((uint32_t)~0 >> ((64 - Rb/#b * 8) % 64))
MSKWH   Ra, Rb/#b, Rc  ; Rc = Ra & ~((uint64_t)~0 >> ((64 - Rb/#b * 8) % 64))
```

These instructions zero out the bytes of a 128-bit value that are about to be replaced by an insertion.

For example, here's how the masking of a long would work:

```    high part  low part
--------- ---------
ABCD EFGH IJKL MNOP -- 16-byte value
^^^ ^         -- 4 bytes to be inserted here
ABCD E000           -- MSKLH
0JKL MNOP -- MSKLL
```

Putting the pieces together, we see that in order to replace a long in the middle of a 128-bit value, you would use the insertion instructions to place the new value in the correct position, the masking instructions to zero out the bits that used to be there, and then "or" the pieces together.

```    ; store an unaligned long in t1 to (t0)
; first read the 128-bit value currently in memory
LDQ_U   t2,3(t0)                    ; t2 = yyyy yyyD
LDQ_U   t5,(t0)                     ; t5 =           CBAx xxxx

; build the values to insert
INSLH   t1,t0,t4                    ; t4 = 0000 000d
INSLL   t1,t0,t3                    ; t3 =           cba0 0000

; mask out the values to be replaced
MSKLH   t2,t0,t2                    ; t2 = yyyy yyy0
MSKLL   t5,t0,t5                    ; t5 =           000x xxxx

; "or" the new values into place
BIS     t2,t4,t2                    ; t2 = yyyy yyyd
BIS     t5,t3,t5                    ; t5 =           cbax xxxx

; and write the results back out
STQ_U   t2,3(t0)                    ; must store high then low
STQ_U   t5,(t0)                     ; in case there was no straddling
```

Extending this pattern to quads and words is left as an exercise.

Notice that in the case where t0 does not straddle two quads, we perform two reads from the same location, and two writes to the same location. Let's walk through what happens:

```    ; first read the 128-bit value currently in memory
; (which is really the same 64-bit value twice)
LDQ_U   t2,3(t0)                    ; t2 = yyDC BAxx
LDQ_U   t5,(t0)                     ; t5 = yyDC BAxx

; build the values to insert
INSLH   t1,t0,t4                    ; t4 = 00dc ba00
INSLL   t1,t0,t3                    ; t3 = 0000 0000

; mask out the values to be replaced
MSKLH   t2,t0,t2                    ; t2 = yy00 00xx
MSKLL   t5,t0,t5                    ; t5 = yyDC BAxx

; "or" the new values into place
BIS     t2,t4,t2                    ; t2 = yydc baxx
BIS     t5,t3,t5                    ; t5 = yyDC BAxx

; and write the results back out
STQ_U   t2,3(t0)                    ; write same value back
STQ_U   t5,(t0)                     ; write updated value
```

This highlights some of the weird memory effects of the Alpha AXP. If another thread snuck in and modified the memory at t0 & ~7, those changes would be reverted at the first `STQ_U`, and then the updated value gets written next. This means that the value changes from `yyyyDCBAxx` to `zzzzDCBAww`, and then back to `yyyyDCBAxx`, and then finally to `yyyydcbaxx`. The value changes, and then appears to change back to the old value, before finally being updated to a new (sort-of) value.

We'll learn more about the Alpha AXP memory model later.

In the case where you are writing a word and you know that it is aligned, then you can avoid having to deal with the 128-bit value and operate within a 64-bit value (because an aligned word will never straddle two quads).

```    ; store an aligned word in t1 to (t0)
; first read the 64-bit value currently in memory
LDQ_U   t5,(t0)                     t5 = yyBA xxxx

; build the value to insert
INSWL   t1,t0,t3                    t3 = 00ba 0000

; mask out the values to be replaced
MSKWL   t5,t0,t5                    t5 = yy00 xxxx

; "or" the new values into place
BIS     t5,t3,t5                    t5 = yyba xxxx

; and write the results back out
STQ_U   t5,(t0)
```

Okay, but what about bytes? Well, bytes can never be misaligned, so we always go through the "known aligned" shortcut.

```    ; store a byte in t1 to (t0)
; first read the 64-bit value currently in memory
LDQ_U   t5,(t0)                     t5 = yyyA xxxx

; build the value to insert
INSBL   t1,t0,t3                    t3 = 000a 0000

; mask out the values to be replaced
MSKBL   t5,t0,t5                    t5 = yyy0 xxxx

; "or" the new values into place
BIS     t5,t3,t5                    t5 = yyya xxxx

; and write the results back out
STQ_U   t5,(t0)
```

Dealing with unaligned memory on the Alpha AXP is very annoying. Notice that updates to words and bytes, even aligned words, is not atomic. We read the entire quad from memory, perform some register calculations, and then write the entire quad back out. If somebody made a change to another byte within the quad, we will wipe out that change when we complete our word or byte update.

Next time, we'll look at atomic memory operations.

Bonus chatter: There is one more pair of instructions which operate on the bytes within a register: `ZAP` and `ZAPNOT`.

```    ZAP     Ra, Rb/#b, Rc  ; Rc = Ra after zeroing the bytes selected by Rb/#b
ZAPNOT  Ra, Rb/#b, Rc  ; Rc = Ra after zeroing the bytes selected by ~Rb/#b
```

The `ZAP` and `ZAPNOT` instructions treat the low-order 8 bits of the second parameter as references to the corresponding bytes of the Ra register: Bit n of Rb/#b corresponds to bits N × 8 through N × 8 + 7. The `ZAP` instruction sets the byte to zero if the corresponding bit is set; the `ZAPNOT` instruction sets the byte to zero if the corresponding bit is clear. The other 56 bits of the second parameter are ignored.

For example, `ZAP v0, #128, v0` clears the top byte of v0, and `ZAPNOT v0, #128, v0` clears all but the top byte of v0. (For some reason, I had trouble remembering which way is which. My trick was to pretend that the `ZAPNOT` instruction is called `KEEP`.)

As a special case, these instructions provide a handy way to zero-extend a register.

```    ZAPNOT  Ra, #1, Rc  ; zero-extend byte from Ra to Rc
ZAPNOT  Ra, #3, Rc  ; zero-extend word from Ra to Rc
ZAPNOT  Ra, #15, Rc ; zero-extend long from Ra to Rc
```

Note that in the last case, zero-extending a negative long will result in a 32-bit value in non-canonical form. But you hopefully were expecting that; if you want to sign-extend the value (in order to ensure a value in canonical form), you would have done `ADDL Ra, #0, Rc`.

Tags

1. Antonio Rodríguez says:

Thus, in many common operations (updating a byte or a word in an array – or a string), pure RISC machines take up to five instructions where a CISC machine would do it in just one. Of course, if you are processing a string (or an array), you can batch the updates and save some instructions (i.e., if you write all of a quad’s eight bytes, you only have to execute one store, and if you don’t mind the original contents, you save the load). I wonder what the performance hit is.

1. Matthew Vincent says:

The whole point of a RISC architecture is to reduce the cost of an instruction so that they take less clock cycles, and hence in a given number of clock cycles you can do more. CISC instruction sets tend to be converted to micro ops internally, which would more closely mimic a RISC architecture.

1. Someone says:

But the high number of code bytes a RISC cpu needs to perform even very simple operations occupy valuable memory bandwidth (and cache). The CPU and the memory and the caches should mainly process or transfer application data, not micro-operations.

2. Antonio Rodríguez says:

Yes, I know what means the difference between “C” and “R” :-) . But as Someone said, storing a single byte or aligned word can take up to six operations (address calculation, loading, masking the original value, rotating the byte/word to match its destination, logical OR and storing) and three temporary registers (address, temporary value, and rotated byte/word). Of course, unaligned accesses will be more expensive.

Classic RISC machines typically had about twice the execution units than CISC machines of the same era (nowadays, with the Intel Core series spotting six integer execution units, I doubt it would be true, but we are talking about late 90s machines, right?). And yes, I talk about micro-instruction units, but such units in CISC processors are able to handle unaligned byte/word access, so you can execute a complex load/store in two micro-instructions (address calculation and actual load/store), instead of the three-to-six it would take in a RISC machine.

So the question remains: are byte and word accesses so scarce that this performance hit in RISC processors isn’t critical? My guess is that it depends on the software mix. But if an architecture’s performance heavily depends on the kind of software executed, maybe it can not be called general-purpose…

Anyway, we live in a post-RISC world, where classical CISC and RISC machines do not exist anymore, and they are very alike, apart from the instruction encoding and the (non)existence of a micro-instruction translating unit (also, Windows and Unix are more similar than many people would like you to believe, but that’s a different matter). So today, all of this is moot.

1. asdf says:

This is all Alpha-specific. Outside of some special-purpose DSP instruction sets I can’t off the top of my head think of any other instruction set that does not support byte or half-word memory accesses. And as Raymond mentioned in one of the earlier posts, even the later Alpha models added them (starting with the 21164A, according to Wikipedia).

Still, although the Alpha ISA seems kind of ugly and inelegant compared to eg. MIPS64, it was very fast at the kinds of code it was designed to run. Unaligned accesses being slow is not a problem if you don’t have unaligned data!

2. Remember, for comparison, that the DEC Alpha 21064 (the first AXP chip) was in volume release in September 1992 at 150 MHz. For comparison, the Pentium 66 MHz wasn’t released until March 1993; by this point the Alpha had been released in a 200 MHz version – so the Alpha had a clock speed advantage as well as being dual-issue superscalar. The two put together meant that the Alpha was executing three to six instructions for every instruction a Pentium issued, depending on whether your code could use both U and V pipelines on the Pentium, or just U.

That, in turn, meant that for pure memory access (no compute), the Alpha was no slower than the Pentium (while it took 6 instructions, it executed those 6 in the same time frame it took the Pentium to execute 1), but when compute was required, the Alpha was much faster.

1. Someone says:

Outside of pure mathematical applications (and SPEC runs), this does not buy the RISC CPUs that much. Higher nominal clock, but the same (or worse or much worse) throughput (exhibit A: Itanium).
No web server, no word processor, no compiler, JIT compiler or interpreter, not even graphical computations is well handled by the AXP instructions set as represented so far, because of its glourious inability to perform simple memory operations. No (countable) constants as operands, no full addresses (or large enough offsets) as operands , only one single addressing mode, nothing to make string operations (that is, double-byte or single-byte arrays) performant? Thats dumb, but for sure, not very advanced.

For example, if they really wanted to use 32-bit op codes ONLY, they could at least have designed an operation to load constants/addresses like this: Its (a) an unconditional jump over the next two 32-bit words, and (b) also a PC-relative load operation which loaded this two 32-bit words in a register. This way, they could have stayed “pure RISC”, but would allow much saner handling of constants with more than a few bits.

Silicon is cheap. There is no reason to not use it to make the CPU more efficient.

2. asdf says:

I think you’ll find the cost of silicon was somewhat higher in 1992. Web servers also weren’t that much of a consideration back then.

3. Someone says:

According to Wikipedia, the original EV4 had 1.68 millions transistors in 1992, whereas the original Pentium had 3,1 million in 1993. DEC could have spent over a million transistors more to make the design somewhat less restricted, and therefore much more suitable for general-purpose applications.

4. Itanium is much more recent than Alpha, and suffered because it had neither an IPC advantage nor a clock speed advantage over its contemporary x86 chips. When the Itanium shipped in 2001, it clocked at 800 MHz, to the 1.13 GHz that Pentium III had at the time.

In contrast, the first 150 MHz Alpha chips were competing against 50 MHz 486-DX2 chips, and could execute two instructions per clock to the 486’s single IPC limit, for 6 times the instruction throughput of the 486. Thus, it didn’t matter that for memory access, the Alpha needed 6 times the instructions – it was executing 6 times as many instructions per second as the 486 could. By the time the Pentium arrived, the Alpha was enough faster that it could maintain the same ratio of instructions per second.

Context is everything, and at the time the Alpha was released, it was considerably faster than contemporary CISC CPUs; even if you hit its glass jaw (sub-word accesses), it was as fast as its contemporaries, while if you avoided that, it was faster.

5. Antonio Rodríguez says:

One thing is clear: the soundest defenses of the Alpha in this thread say that in context, the original Alpha was no slower than the original Pentium. To me, this means that, even when running at a far greater clock speed, the Alpha was comparable to the Pentium (back then, the best performant CISC processor – the 68060 would arrive later and with severe clock speed limits). Of course, contesting the best CISC processor of the time (and being a year ahead of it) is no small feat!

I’m in no way attacking the Alpha or the RISC architecture (after all, every x86 processor after the Pentium Pro is basically a RISC one with an instruction translation unit in front of it!). I’m just trying to look at the RISC vs. CISC claims of the 90s from the perspective that years give us.

6. Stronger than just “no slower than”; if you constructed an artificial benchmark to slow the Alpha down, you could force it down to Pentium speeds. If you ran real code, it could be as much as 6x faster.

Indeed, the Alpha was so much faster than contemporary x86 (Pentium, Pentium Pro) that the Alpha running DEC FX!32 could execute real x86 applications at the equivalent of an x86 at 40% of the Alpha’s clock speed. Because the Alpha was clocking much higher than its contemporary x86 competitors, this meant that it was a faster system overall (when FX!32 first arrived, Alphas were typically clocking 6x a Pentium, making FX!32 and an Alpha 1.6 times the speed of the x86 when running x86 code – it wasn’t until the Pentium II in 1997 that a native x86 processor was faster at running x86 code than an Alpha with FX!32).

It’s fun to speculate on what the world would look like if DEC had done one of two things differently:

1. Released a version of FX!32 that lets you mix x86 and Alpha code in an “Alpha” application.
2. Instead of using ARM’s ISA to demonstrate that the Alpha design techniques scaled to low power as well as high performance (StrongARM), they’d used Alpha’s ISA in a low power, decent performance chip.

7. Klimax says:

I’d like to see those benchmarks. Got any references to back up those performance claims?

8. @Klimax See, for example, http://www.realworldtech.com/x86-translation/3/ which claims a higher performance delta than I did; they found 50% to 70% of Alpha clock speed when running x86 code, whereas the figures I’ve got in my notes from 1998 say about 40% of Alpha clockspeed (probably because the P-II was a faster chip than the PPro.

9. “(a) an unconditional jump over the next two 32-bit words, and (b) also a PC-relative load operation which loaded this two 32-bit words in a register.” I still don’t see what the advantage is. The above sequence requires 3 longs to encode. (The unconditional jump, the PC-relative load, and the 32-bit value.) The same operation in the existing instruction set requires 3 longs to encode in the worst case and 2 instructions in the common case.

10. Klimax says:

@Simon Farnsworth
Thanks.

11. smf says:

“It’s fun to speculate on what the world would look like if DEC had done one of two things differently:”

The Intel/DEC settlement is probably their biggest downfall. Intel violates DEC patents, Intel settles by agreeing to buy StrongARM from DEC. Just how poisonous was StrongARM to DEC?

2. Cesar says:

Do not generalize from Alpha to all of RISC! Other RISC ISAs (including AFAIK later versions of Alpha) do have instructions to read or write a single byte or word. What’s less common is unaligned read/write, but even then some RISC ISAs do have instructions for unaligned read/write.

2. Matthew Vincent says:

Raymond, keep up the good work. I loved your series on the Itanium.

3. Joshua says:

The first thing that comes to mind is atomic_t for getting the size integer that can be written atomically.

The second thing that comes to mind is processing UTF-16 strings on Alpha sucks.

1. Yukkuri says:

Wow I sure should have edited that link before posting it…

2. Antonio Rodríguez says:

https://www.cs.princeton.edu/courses/archive/fall10/cos375/Byte-case.pdf

Interesting reading. And, if you remove the file from the URL, you get to the course’s main page, where you can find more interesting texts about computer architecture under the “Enrichment links” heading, at the bottom of the page.

4. Makes me glad that for the only RISC architecture I’ve worked with (MIPS) storing unaligned data is easy, since it has instructions that handle it all for you.

1. Hmm… now that I think on it, you still had to be 16-bit aligned I believe, but it had opcodes for writing 8-bit, 16-bit, 32-bit, or 64-bit values, so it was still simpler to perform the writes.

1. asdf says:

MIPS requires data to be naturally aligned, but has some special instructions (Load/Store Word Left/Right) to assist with handling unaligned data. These instructions were somewhat infamous as they were covered by patents which MIPS used to block unlicensed instruction set implementations.

1. asdf says:

I guess that should read “required”, since MIPSr6 supports misaligned memory accesses for normal loads and stores, and removes the unaligned memory access instructions.

5. Neil says:

Why would you want to zero-extend a negative long? I always associate zero-extending with unsigned integers. (Mind you, I’m still unclear as to this canonical form concept; is it just the case that there are a parallel set of instructions that include an additional sign-extension step as opposed to having a sign-extension instruction for when you need it, or is there something more to it?)

1. Neil says:

So, the answer to my parenthetical is that the Alpha AXP only has 64-bit comparison instructions; performing signed 64-bit compares on sign-extended 32-bit values obviously gives the same result as signed 32-bit compares but conveniently performing unsigned 64-bit compares on sign-extended 32-bit values also gives the same result as unsigned 32-bit compares.