The Swiss army knife
instruction of the PowerPC 600 series instruction set
is rlwinm
,
which stands for "rotate left word immediate and mask."
rlwinm rd, ra, imm5a, imm5b, imm5c rlwinm. rd, ra, imm5a, imm5b, imm5c ; also updates cr0
This instruction does everything except wash your floor.
First it takes the current value of the ra register and rotates it left by imm5a bits. Then it keeps all the bits imm5b to imm5c, inclusive, and clears all the other bits. If imm5b is greater than imm5c, the bits to keep wrap around! (Don't forget that bit zero is the most significant bit.) The result of all this is placed in the rd register.
Let's take a few examples.
rlwinm rd, ra, 5, 6, 20
Here, we take ra, rotate it left by 5 bits, and then keep bits 6 through 20. Since bit 0 is the most significant bit, then that means that the mask is
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 
0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0 
0 
3 
F 
F 
F 
8 
0 
0 
This comes out to 0x03FFF800
.
Therefore, the result of the operation is
rd = rotl(ra, 5) & 0x03FFF800
The other case is where the bit count wraps around:
rlwinm rd, ra, 5, 20, 6
This time, we start by setting bit 20, and continue to the end of the word, and then wrap around and start setting bits starting at bit 0, and then finally stop when we get to bit 6.
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 
1  1  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  1  1 
F 
E 
0 
0 
0 
F 
F 
F 
The net result is therefore
rd = rotl(ra, 5) & 0xFE000FFF
The Windows debugger for PowerPC was not in use for very long, and consequently its disassembler isn't particularly advanced. But one thing it does do for you is unpack the last two parameters and tell you what the final mask is.
10ae222c 54642d0c rlwinm r4, r3,5,20,6 MASK=0xfe000fff
(Yes, the disassembler can't make up its mind whether it puts a space after a comma or not. Like I said, the debugger didn't last long enough to go through multiple iterations of polish.)
With the rlwinmi
instruction, we can build many things.
(In all examples, bit index arithmetic wraps around at 32.)

Rotate left by n bits:
rlwinm rd, ra, n, 0, 31
Rotate left by n bits, and specify a mask that starts at bit 0 and ends at bit 31. This mask is
0xFFFFFFFF
, which means that no bits get cleared. 
Rotate right by n bits:
rlwinm rd, ra, 32  n, 0, 31
Rotate left by 32 − n bits, and specify a mask that starts at bit 0 and ends at bit 31. This mask is
0xFFFFFFFF
, which means that no bits get cleared. 
Shift left by n bits:
rlwinm rd, ra, n, 0, 31  n
Rotate left by n and clear the rightmost n bits. In Windows NT, the most common version of this is
rlwinm rd, ra, 2, 0, 29
which shifts a value left two places. This multiplies it by four, which is a common operation when indexing an array of pointers.

Shift right by n bits:
rlwinm rd, ra, 32  n, n, 31
Rotating left by 32 − n is the same as rotating right by n. We then clear the leftmost n bits by saying we want to keep the bits starting from position n to the end of the register.

Pluck bit n out of a 32bit value:
rlwinm rd, ra, n + 1, 31, 31
Rotate left by n + 1 to position the desired bit in position 31, then clear all the other bits.

Extract a bitfield of length m at position n:
rlwinm rd, ra, n + m, 32  m, 31
Rotate left by n + m, which rightaligns the field, then clear all but the rightmost m bits.

Take the least significant m bits of a value and position it so it can be inserted into a bitfield starting at position n:
rlwinm rd, ra, 32  n  m, n, n + m  1
Rotate right by n + m, which positions the field into its final location, then clear all the bits that aren't used to represent the field.

Set a bitfield of length m at position n to zero:
rlwinm rd, ra, 0, n + m, n  1
A rotation of zero does nothing, but here's where we exploit the wraparound behavior of the mask: We keep the bits starting at n + m, which is the last bit past the end of the field, and continue keeping bits through the end of the register, and then wrap around and keep bits starting at bit zero, and stop at bit n − 1, which is the last bit before the start of the field.

Zeroextend a byte to a word and do not update cr0:
rlwinm rd, ra, 0, 24, 31
We perform no rotation and zero out the most significant 24 bits. We can also do zero extension with
andi.
, but therlwinm
instruction lets us do it without updating cr0. 
Zeroextend a halfword to a word and do not update cr0:
rlwinm rd, ra, 0, 16, 31
This time, after doing no rotation, we zero out the most significant 16 bits. Again, we could have done this with
andi.
, but this way lets us do it without updating cr0.
There is also a version of the instruction where the rotation amount comes from a register.
rlwnm rd, ra, rb, imm5a, imm5b rlwnm. rd, ra, rb, imm5a, imm5b ; also updates cr0
"Rotate left word and mask" is like "rotate left word immediate and mask", except that the rotation amount is specified by the value of a register. (Since rotation by a multiple of 32 bits is a nop, it doesn't matter whether bits 0 through 26 in rb are respected or ignored.)
rlwimi rd, ra, imm5a, imm5b, imm5c rlwimi. rd, ra, imm5a, imm5b, imm5c ; also updates cr0
"Rotate left word immediate and mask insert" rotates the value in the ra register left by imm5a bits, and then copies bits imm5b through imm5c of the rotated value (wrapping around if necessary) to rd, leaving the other bits of rd alone. This instruction is most useful for storing a value into a bitfield:
rlwimi rd, ra, 32  n  m, n, n + m  1
The above instruction takes the least significant m
bits of ra and sets them into a bitfield of size m
starting at position n in rd.
We did the math for this before, when we tried out
rlwinm
to position a bitfield.
By using rlwimi
, we get to store it.
Okay, now we can get to the true shift instructions.
slw rd, ra, rb ; rd = ra << (rb % 64) slw. rd, ra, rb ; rd = ra << (rb % 64), update cr0 srw rd, ra, rb ; rd = ra >> (rb % 64) srw. rd, ra, rb ; rd = ra >> (rb % 64), update cr0
The slw
and srw
instructions shift a register left or right by
an amount specified by the value of another register.
Notice that the shift amount is taken mod 64,
rather than mod 32.
This means that a shift by 63 will set the result to zero,
but a shift by 64 will do nothing.
sraw rd, ra, rb ; rd = (int32_t)ra >> (rb % 64), update carry sraw. rd, ra, rb ; rd = (int32_t)ra >> (rb % 64), update carry and cr0 srawi rd, ra, imm5 ; rd = (int32_t)ra >> imm5, update carry srawi. rd, ra, imm5 ; rd = (int32_t)ra >> imm5, update carry and cr0
The sraw
instruction performs an arithmetic right shift
by an amount specified by the rb register,
and the srawi
instruction does the same, but with
an immediate shift amount.
These four shift instructions are special because they always update carry:
The carry bit is set if and only if the original value was negative
and any bits shifted out were nonzero.
This rule for the carry bit allows you to follow the rightshift instruction
with an addze
to perform a division by a power of two
that rounds toward zero.
(If you omit the addze
,
then the right shift performs a division by a power of two that
rounds toward minus infinity.)
Exercise: How would you do a division by a power of two that rounds toward positive infinity?
Here's a sample code sequence to perform a Cstyle logical not operation:
cmpwi r3,0 ; set EQ if value was zero mfcr r3 ; r3 = cr rlwinm r3,r3,eq+1,31,31 ; save the EQ bit
Similarly, you can calculate the C logical <
and >
operations by performing the comparison
and extracting the lt or gt bit.
To get the other three comparisons !=
,
<=
and >=
,
you follow up with xori r3, r3, 1
.
Okay, those are the logical and shifting instructions. Next time, we'll look at memory access.
I assume, “shift left” and “shift right” is defined to match the usual arithmetic meaning (shifting left => multiply byf 2), or do they even flip this, to match their bit numbering?
Yes, the shift left and right have the usual arithmetic meaning.
I’m wondering what hardware engineers think of rlwinm and friends.
Are they straightforward to implement and we should ask ourselves why other ISAs didn’t have such instructions? Or are they a nightmare to implement, and a case could be made against having these instructions at all?
How would one implement them in the first place? The rotation part is simple, it’s the mask/insert part that makes me wonder.
The hardware is fairly straightforward.
It basically exposes the nuts and bolts of one of the standard barrel shifter architectures (google “Energydelay tradeoffs in 32bit static shifter designs” for a paper that covers it). The mask generation is already there (the PPC version is more flexible than usual, but not in a way that adds significant complexity), so is a bunch of multiplexers that perform the masked insertion on the final output stage, since they’re necessary to implement arithmetic shifts anyway (logical shifts and rotates could make do with just a bunch of ANDs).
The reason other ISAs don’t have this has little to do with the circuity (this is rarely ever the reason) and everything to do with how many bits it takes to encode these instructions. Two 5bit register numbers, three 5bit immediates for the shift distance and mask begin/end positions, and two more bits to encode the variants (“rlwinm”, “rlwinm.”, “rlwimi”, “rlwimi.”) makes for 27 bits used; that’s 1/32th of the entire 32bit opcode space (which is expensive real estate that you never get to take back!) dedicated to shifts. The original PPC designers clearly deemed that worthwhile, other ISAs thought differently.
PPC then got burned a few years later when they defined the 64bit version. Direct 64bit equivalents would need 6bit shift distances and mask positions, for an extra 3 bits taken. But 30 bits (1/4 of the opcode space) is just too much. So 64bit PPC keeps the same underlying HW for 64bit shifts but doesn’t let you encode all combinations of the mask begin/end immediates anymore, just certain combinations deemed valuable (the 32bit versions of instructions are still there and keep their generality though).
64bit ARM (A64 encoding) found a good compromise: it has 3 basic “bitfield move” instructions (BFM, SBFM, UBFM) that usually disassemble to one of 15 (!) different aliases. BFM corresponds to PPC “rlwimi”, UBFM corresponds to PPC “rlwinm”, and SBFM has no direct equivalent in PPC and is effectively a “rlwinm” with sign extension of the result. But unlike PPC which lets you specify the shift amount and mask begin/end independently, A64 only gives you variants that either move a contiguous bitfield from anywhere to a field starting at bit 0, or a contiguous bitfield starting at bit 0 to anywhere. That means you only need two immediates (saving 6 bits, which puts us in totally manageable territory for instruction lengths), and the general anywheretoanywhere operation, if desired, can be done in two steps. (Move the target bits to bit 0 in the first step, and from there to their final destination in the second).
That is fascinating!
Once I learned a couple of things about PowerPC for a code golf, and I distinctly remember the awe discovering this mad instruction that could do the strangest things to bitfields. Along with barrel shifter you get for free in a lot of ARM instructions, it made me think that these RISCy architectures try to compensate complex instructions with creative usage of shifts.
I don’t know how others feel, but this instruction doesn’t feel very RISClike to me.
It is called Reduced ISC not Minimal ISC… :)
RISC is defined by minimising addressing modes, only performing memory access with dedicated load/store instructions, and minimising interlocks. There’s nothing in these instructions that violates the core RISC values. They don’t access memory, they don’t have exotic addressing modes, they access at most two source registers and one destination register (plus condition/exception codes if applicable). The only thing that’s a bit unRISC is that rlwimi implicitly uses its destination as a source. The mask generation isn’t actually particularly complex in silicon – it’s no worse than a full lookahead barrel shifter.
> Exercise: How would you do a division by a power of two that rounds toward positive infinity?
How about this:
; r3 = dividend, quotient rounded to +inf
; r4 = divisor_log2
sraw r5, r3, r4
oris r3, r3, 0x8000
sraw r3, r3, r4
addze r3, r5
If you can assume that the value is not 0x80000000, then
subf ra, ra, 0
sraw rd, ra, rb (or rlwinm if rb is a constant)
subf rd, rd, 0
But if 0x80000000 is a possibility, then I was going to add 2^rb1 before shifting, but it looks like your solution is cleverer.
I’m reminded of OISC instruction sets (which have exactly one instruction and therefore do not need opcodes). The traditional one instruction is “subtract and branch if less than or equal to zero,” but I have to wonder how much complexity you would need to add to rlwinm to make it Turingcomplete all by itself. Probably just some branch semantics, but maybe if you can use the program counter as a destination, you don’t even need that.