Suppose you want to multiply two double-word values
producing a quad-word result,
but your processor supports only single-word multiplication
with a double-word result.
For concreteness, let’s say that your processor supports
Oh boy, let’s do some high school algebra. Let’s start with unsigned multiplication.
Let x = A × 2³² + B and y = C × 2³² + D, where A, B, C, and D are all in the range 0 … 2³² − 1.
| x × y = | AC × 264 + (AD + BC) × 232 + BD | ||
| = | AC × 264 + BD | + | (AD + BC) × 232 |
| provisional result | cross-terms | ||
Each of the multiplications (not counting the power-of-two multiplications)
is a
// Prepare our source registers
movq xmm0, x // xmm0 = { 0, 0, A, B } = { *, *, A, B }
movq xmm1, y // xmm1 = { 0, 0, C, D } = { *, *, C, D }
punpckldq xmm0, xmm0 // xmm0 = { A, A, B, B } = { *, A, *, B }
punpckldq xmm1, xmm1 // xmm1 = { C, C, D, D } = { *, C, *, D }
pshufd xmm2, xmm1, _MM_SHUFFLE(2, 0, 3, 1)
// xmm2 = { D, D, C, C } = { *, D, *, C }
The PMULUDQ instruction multiplies 32-bit lanes 0 and 2
of its source and destination registers,
producing 64-bit results.
The values in lanes 1 and 3 do not participate in the multiplication,
so it doesn’t matter what we put there.
It so happens that the PUNPCKLDQ instruction duplicates
the value, but we really don’t care.
I used * to represent a don’t-care value.
pmuludq xmm1, xmm0 // xmm1 = { AC, BD } // provisional result
pmuludq xmm2, xmm0 // xmm2 = { AD, BC } // cross-terms
In two PMULUDQ instructions,
we created the provisional result and the cross-terms.
Now we just need to add the cross-terms to the provisional result.
Unfortunately, SSE does not have a 128-bit addition (or at least SSE2
doesn’t; who knows what they’ll add in the future),
so we need to do that the old-fashioned way.
movdqa result, xmm1
movdqa crossterms, xmm2
mov eax, crossterms[0]
mov edx, crossterms[4] // edx:eax = BC
add result[4], eax
adc result[8], edx
adc result[12], 0 // add the first cross-term
mov eax, crossterms[8]
mov edx, crossterms[12] // edx:eax = AD
add result[4], eax
adc result[8], edx
adc result[12], 0 // add the second cross-term
There we go, a
Exercise: Why didn’t I use the rax register
to perform the 64-bit addition?
(This is sort of a trick question.)
That calculates an unsigned multiplication, but how do we do a signed multiplication? Let’s work modulo 2128 so that signed and unsigned multiplication are equivalent. This means that we need to expand x and y to 128-bit values X and Y.
Let
s = the sign bit of x expanded
to a 64-bit value, and similarly
t = the sign bit of y expanded
to a 64-bit value.
In other words,
s is 0xFFFFFFFF`FFFFFFFF if
x < 0
and zero if x ≥ 0.
The 128-bit values being multiplied are
| X = | s × 264 + x |
| Y = | t × 264 + y |
The product is therefore
| X × Y = | st × 2128 | + | (sy + tx) × 264 | + | xy |
| zero | adjustment | unsigned product |
The first term is zero because it overflows the 128-bit result. That leaves the second term as the adjustment, which simplifies to “If x < 0 then subtract y from the high 64 bits; if y < 0 then subtract x from the high 64 bits.”
if (x < 0) result.m128i_u64[1] -= y;
if (y < 0) result.m128i_u64[1] -= x;
If we were still playing with SSE, we could compute this as follows:
movdqa xmm0, result // xmm0 = { high, low }
movq xmm1, x // xmm1 = { 0, x }
movq xmm2, y // xmm2 = { 0, y }
pshufd xmm3, xmm1, _MM_SHUFFLE(1, 1, 3, 2) // xmm3 = { xhi, xhi, 0, 0 }
pshufd xmm1, xmm1, _MM_SHUFFLE(1, 0, 3, 2) // xmm1 = { x, 0 }
pshufd xmm4, xmm2, _MM_SHUFFLE(1, 1, 3, 2) // xmm4 = { yhi, yhi, 0, 0 }
pshufd xmm2, xmm2, _MM_SHUFFLE(1, 0, 3, 2) // xmm2 = { y, 0 }
psrad xmm3, 31 // xmm3 = { s, s, 0, 0 } = { s, 0 }
psrad xmm4, 31 // xmm4 = { t, t, 0, 0 } = { t, 0 }
pand xmm3, xmm2 // xmm3 = { x < 0 ? y : 0, 0 }
pand xmm4, xmm1 // xmm4 = { y < 0 ? x : 0, 0 }
psubq xmm0, xmm3 // first adjustment
psubq xmm0, xmm4 // second adjustment
movdqa result, xmm0 // update result
The code is a bit strange because SSE2 doesn’t have a
full set of 64-bit integer opcodes.
We would have liked to have used a
psraq instruction to fill a 64-bit field with a sign bit.
But there is no such instruction, so instead we
duplicate the 64-bit sign bit into two 32-bit sign bits (one in lane 2
and one in lane 3)
and then fill the lanes with that bit using psrad.
Lbh qvqa'g hfr gur enk ertvfgre orpnhfr ba n cebprffbe fhccbegvat gung, lbh jbhyq unir nyernql unq n zhygvcyvpngvba lvryqvat bar uhaqerq gjragl rvtug ovgf sebz gjb fvkgl sbhe ovg ertvfgref angviryl ninvynoyr.
re. Exercise: You're writing 32-bit code, therefore have no access to the RAX register, otherwise you would have just used 64-bit operands and a single multiplication instruction in the first place!
Shouldn't it be
add result[4], eax
adc result[8], edx
adc result[12], 0
? I think we need to carry the first term too?
Is ROT13 common on this board, or is Dasyatid1 a geocacher?
I see that it IS ROT13. At first I thought it was just insane gibbering.
Luckily I remembered the arcane Unix incantation: tr '[A-Za-z]' '[N-ZA-Mn-za-m]'
I need to benchmark but I suspect there are more efficient ways of doing this
I just thought it was a reference to all the movdqa, pmuludq, punpckldq making the post look like gibberish.
For the signed portion of the question, I think that the below shorter code will work. Whether it's faster, I have no idea. It uses the following identity, which is true assuming signed overflow is well-defined as wraparound: 1 + ~a == -a. To complete the explanation behind it, NOT is the same as XOR with all 1's (-1) and subtracting -1 is the same as adding 1.
mov eax, [x + 4] // high dword of x
mov edx, [y + 4] // high dword of y
xor eax, edx // XOR the sign bits
sra eax, 31 // ((x < 0) ^ (y < 0)) ? -1 : 0
xor [result + 8], eax
xor [result + 12], eax
sub [result + 8], eax
sbb [result + 12], eax
I tried running the original code and was not getting a correct answer in result dwords 1 and 2 out of 0-3. Thus, I haven't tested mine.
Once in a while, you'd find that you missed the ROT13 encoder that comes with Outlook Express.
"Luckily I remembered the arcane Unix incantation"
Fortunately, you are also (presumably) using a web browser to read this page, so you could paste that text into the textbox at rot13.com and get the result.
@RickC: … you could paste that text into the textbox at rot13.com and get the result.
M-x rot13-region
M-x all-hail-emacs
SFU for Windows XP didn't like that tr incantation for some reason, I had to use tr A-Ma-mN-Zn-z N-Zn-zA-Ma-m instead. (The square brackets are only needed for System V compatibility.)
Re ROT13: "Leetkey" addon for Firefox.
I'm confused by this:
movq xmm1, y // xmm1 = { 0, 0, C, D } = { *, *, C, D }
…
punpckldq xmm1, xmm1 // xmm1 = { C, C, D, D } = { *, C, *, D }
pshufd xmm2, xmm1, _MM_SHUFFLE(2, 0, 3, 1)
// xmm2 = { D, D, C, C } = { *, D, *, C }
Specifically, why is the unpack necessary? Can't you just shuffle {0,0,C,D} to {0,D,0,C} directly? (I honestly don't know – I've never played with SSE at all.)
xmm1 = { *, C, *, D }for thepmuludq xmm1, xmm0that follows. -Raymond]Thanks! I knew I was missing something.
I found this an interesting read and it made me wonder if I could do a similar trick with my 16-bit microcontroller that has an MPY32 peripheral (32×32=64 with 16-bit register interfaces). Then I wondered "What the heck would I do with a 128-bit number on a 16-bit MCU?" Not to mention figuring out how to represent it since the compiler tops out at 64-bit long long.
Certainly a trick to remember though, it may be useful in trying to implement a constant-time Curve25519 ECC computation algorithm…
"M-x rot13-region
M-x all-hail-emacs"
Sure, but then I'd have to use Emacs.
@Brian_EE
Not only can you do it with a 32×32 hardware multiplier peripheral, but you can also do it by recursively simulating each 32×32 multiply with 16×16 multiplies! (Is your mind blown yet?)
@Brian_EE
Also, you would store a 128-bit number with something like:
struct int128 {long long low, high;};
Anonymous Coward (again): You probably want the low 64 bits to be unsigned:
struct int128 {unsigned long long low; long long high;};
@Anonymous Coward: Yes, I know that I could use a struct to build my own data type for 128 bits. But then I also have to create all my own operators (such as addition etc) to manipulate it. The compiler tops out at 64 bits for compiler-generated operations. There is still the "what would I use this for" aspect in the low-power MCU world.