The PowerPC 600 series has two addressing modes.
We will demonstrate them with the lwz instruction,
which loads a word from memory.
lwz rd, disp16(ra/0) ; load word from memory at ra/0 + (int16_t)disp16
lwzx rd, ra/0, rb ; load word from memory at ra/0 + rb
The regular load instruction fetches
a word from a memory location specified by a register
and a signed 16-bit displacement.
By convention, if no displacement is given,
it is assumed to be zero.
The Windows disassembler displays the displacement in hex
without a 0x prefix, but I'm going to put the prefix in
to minimize confusion.
The indexed load instruction adds two registers to determine the address from which to load the word. Note that you cannot use r0 as the base register for the load; if you try to use it, it comes out as zero.¹
Both of the instructions can be suffixed with u
("update") to set the ra register equal to the effective
address of the load.
(If an exception occurs on the memory access, the ra
register is not updated.
This allows the instruction to be restarted.)
lwzu rd, disp16(ra) ; load word from memory at ra + (int16_t)disp16
; and then set ra equal to ra + (int16_t)disp16
; ra may not be r0 or rd
lwzxu rd, ra, rb ; load word from memory at ra + rb
; and then set ra equal to ra + rb
; ra may not be r0 or rd
The ra register cannot be r0 because r0 acts like the zero register during effective address calculations, and it would make no sense to update the zero register. The ra register cannot be the same register as rd because that would create a conflict between the two output registers.
The lwzu instruction is handy if you're walking through
an array, since it lets you step to the next item and fetch a word from
it in a single instruction.
Okay, so here are the ways you can load data from memory.
I will present only the basic form of the instruction,
but understand that
x,
u,
and
xu
forms are also available.
lbz rd, disp16(ra/0) ; load byte and zero extend
lhz rd, disp16(ra/0) ; load halfword and zero extend
lwz rd, disp16(ra/0) ; load word and zero extend
lha rd, disp16(ra/0) ; load halfword and sign extend
; (a = "arithmetic")
There is a bonus sign-extending load of halfwords, but sadly no sign-extending load of bytes.
Why does the lwz instruction say "and zero extend" even though
there's nowhere to extend to?
Because there would be a place to extend to if running on a 64-bit version
of the processor.
(Windows NT runs the processor in 32-bit mode, but the 64-bit
registers are available if the processor supports them.)
There is a corresponding set of store instructions.
stb rd, disp16(ra/0) ; store byte
sth rd, disp16(ra/0) ; store halfword
stw rd, disp16(ra/0) ; store word
; also "x", "u", and "xu" variants.
In particular, the stwu instruction is
extremely handy when setting up your stack frame,
which we'll see later when we learn about software conventions.
All loads and stores should be to suitably-aligned locations. The architecture permits but does not require the processor to support unaligned memory access in little-endian mode, and even if it does support unaligned loads, it might do so only partially. (For example, it might support unaligned loads provided they do not span multiple cache lines.) As noted earlier, if an unaligned store crosses into an invalid page, the processor is permitted to store the valid part before the exception is raised. If the processor does not support an unaligned operation, it will trap, and the kernel will emulate it.
There are no special instructions for assisting with unaligned loads. You're on your own:
; load halfword unaligned from n(r3) into r4 with zero extension
; requires a scratch register r5.
lbz r4, n(r3) ; r4 = least significant byte
lbz r5, n+1(r3) ; r5 = most significant bytes
rlwimi r4, r5, 8, 0, 23 ; merge together
; load halfword unaligned from n(r3) into r4 with sign extension
; requires a scratch register r5.
lbz r4, n(r3) ; r4 = least significant byte
lba r5, n+1(r3) ; r5 = most significant bytes (sign extended)
rlwimi r4, r5, 8, 0, 23 ; merge together
; load word unaligned from n(r3) into r4
; requires a scratch register r5.
lbz r4, n(r3) ; r4 = least significant byte
lbz r5, n+1(r3) ; r5 = next most significant byte
rlwimi r4, r5, 8, 16, 23 ; merge together
lbz r5, n+2(r3) ; r5 = next most significant byte
rlwimi r4, r5, 16, 8, 15 ; merge together
lbz r5, n+3(r3) ; r5 = most significant byte
rlwimi r4, r5, 24, 0, 7 ; merge together
To load an unaligned value, you load up the individual bytes
and merge them using rlwimi.
; store halfword unaligned from r4 to n(r3)
stb r4, n(r3) ; store least significant byte
rlwinm r4, r4, 24, 0, 31 ; rotate right 8 bits
stb r4, n+1(r3) ; store next significant byte
rlwinm r4, r4, 8, 0, 31 ; rotate back to original value
; (in case you still need the value)
; store word unaligned from r4 to n(r3)
stb r4, n(r3) ; store least significant byte
rlwinm r4, r4, 24, 0, 31 ; rotate right 8 bits
stb r4, n+1(r3) ; store next significant byte
rlwinm r4, r4, 24, 0, 31 ; rotate right 8 bits
stb r4, n+2(r3) ; store next significant byte
rlwinm r4, r4, 24, 0, 31 ; rotate right 8 bits
stb r4, n+3(r3) ; store next significant byte
rlwinm r4, r4, 24, 0, 31 ; rotate back to original value
; (in case you still need the value)
To store an unaligned value, you store the individual bytes.
Since the stb instruction stores the last significant
byte, each byte of value takes its turn in the least significant position.
In practice, you are more likely to see the compiler
extract the bytes into a separate register
to avoid long dependency chains,
at the cost of an additional register.
; store halfword unaligned from r4 to n(r3), using r5 as scratch
stb r4, n(r3) ; store least significant byte
rlwinm r5, r4, 24, 0, 31 ; extract next significant byte
stb r5, n+1(r3) ; store next significant byte
; store word unaligned from r4 to n(r3), using r5 as scratch
stb r4, n(r3) ; store least significant byte
rlwinm r5, r4, 24, 0, 31 ; extract next significant byte
stb r5, n+1(r3) ; store next significant byte
rlwinm r5, r4, 16, 0, 31 ; extract next significant byte
stb r5, n+2(r3) ; store next significant byte
rlwinm r5, r4, 8, 0, 31 ; extract next significant byte
stb r5, n+3(r3) ; store next significant byte
Okay, back to addressing modes: Treating r0 as zero for effective address computations gives you absolute addressing to the lowest and highest 32KB of memory. This isn't particularly useful in Windows NT, but I can see how it would be handy in an embedded system where there is no virtual memory. You could map the ROM to the low 32KB and RAM to the high 32KB, and now you have absolute addressing to your entire system.
If you need absolute addressing to anything outside the top and bottom 32KB of address space, you'll have to do something else. One way is to build up the address as a 32-bit constant, like we saw earlier. But the PowerPC takes a different approach: By convention, the r2 register contains a value called the table of contents. But there are some other topics I want to get through before I dig into the Windows NT software conventions, so you'll have to be a bit patient.
Bonus chatter: There are additional instructions available in big-endian mode for loading and storing multiple registers, but they are not available in little-endian mode, so I won't cover them.
¹ Though if you really wanted to perform a load from r0, I guess you could use the indexed load
lwzx rd, 0, r0 ; load word from memory at 0 + r0
Come to think of it, that absoute addressing to the bottom 32k is quite usable. On NT3.5 and NT4, you could reliably map the bottom 32k by requesting a memory mapping at address 1. It would be a lot more sane to place it at 4k though. I could see the compiler getting smart and placing all the small globals in the .exe (can’t do this for .dll for the obvious reason) starting at 32k and working its way down to lower addresses. This would make globals much faster as there’s no fixup for them and no need to load a register with the global pointer either.
Mapping anything in the first few bytes of the virtual address space is asking for trouble. Straight up NULL isn’t the only common result of dereferencing NULL pointers, e.g. foo->bar might as well access 0x12.
Of course there’s no clear “good” value for the zero region. The first page (e.g. 4k on many architectures) used to be common. That will cover almost all structs and small to moderate sized arrays, and is consistent with your later suggestion. In 64bit systems nowadays, it’s convenient to just leave the first 4GB unmapped, that at least covers dereferencing NULL pointers of anything whose size fits into a 32bit unsigned integer, with the added benefit of making any stray 32bit value (which ints still are in (L)LP64, including Windows) an invalid address.
“This isn’t particularly useful in Windows NT, but I can see how it would be handy in an embedded system where there is no virtual memory. ” Don’t forget low memory globals in classic Mac OS.