Creating double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result using intrinsics (part 1)


Some time ago, I derived how to create double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result, but I expressed it in assembly language. This time, I'll express it in C using intrinsics.

Using intrinsics rather than inline assembly language is slightly more portable, since all the compiler toolchains that implement the intrinsics agree on what the intrinsics mean. They disagree on the members of a __m128i, but at least they agree on the intrinsics.

First, a straightforward translation of the assembly language code to intrinsics:

__m128i Multiply64x64To128(uint64_t x, uint64_t y)
{
    // movq xmm0, x         // xmm0 = { 0, 0, A, B } = { *, *, A, B }
    auto x00AB = _mm_loadl_epi64((__m128i*) &x);

    // movq xmm1, y         // xmm1 = { 0, 0, C, D } = { *, *, C, D }
    auto x00CD = _mm_loadl_epi64((__m128i*) &y);

    // punpckldq xmm0, xmm0 // xmm0 = { A, A, B, B } = { *, A, *, B }
    auto xAABB = _mm_unpacklo_epi32(x00AB, x00AB);

    // punpckldq xmm1, xmm1 // xmm1 = { C, C, D, D } = { *, C, *, D }
    auto xCCDD = _mm_unpacklo_epi32(x00CD, x00CD);

    // pshufd xmm2, xmm1, _MM_SHUFFLE(1, 0, 3, 2)
    //                      // xmm2 = { D, D, C, C } = { *, D, *, C }
    auto xDDCC = _mm_shuffle_epi32(xCCDD, _MM_SHUFFLE(1, 0, 3, 2));

    // pmuludq xmm1, xmm0 // xmm1 = { AC, BD } // provisional result
    auto result = _mm_mul_epu32(xAABB, xCCDD);

    // pmuludq xmm2, xmm0 // xmm2 = { AD, BC } // cross-terms
    auto crossterms = _mm_mul_epu32(xAABB, xDDCC);

    // mov    eax, crossterms[0]
    // add    result[4], eax
    auto carry = _addcarry_u32(0,
                               result.m128i_u32[1],
                               crossterms.m128i_u32[0],
                               &result.m128i_u32[1]);

    // mov    edx, crossterms[4] // edx:eax = BC
    // adc    result[8], edx
    carry = _addcarry_u32(carry,
                          result.m128i_u32[2],
                          crossterms.m128i_u32[1],
                          &result.m128i_u32[2]);

    // adc    result[12], 0      // add the first cross-term
    _addcarry_u32(carry,
                  result.m128i_u32[3],
                  0,
                  &result.m128i_u32[3]);

    // mov    eax, crossterms[8]
    // add    result[4], eax
    carry = _addcarry_u32(0,
                          result.m128i_u32[1],
                          crossterms.m128i_u32[2],
                          &result.m128i_u32[1]);

    // mov    edx, crossterms[12] // edx:eax = AD
    // adc    result[8], edx
    carry = _addcarry_u32(carry,
                          result.m128i_u32[2],
                          crossterms.m128i_u32[3],
                          &result.m128i_u32[2]);

    // adc    result[12], 0      // add the second cross-term
    _addcarry_u32(carry,
                  result.m128i_u32[3],
                  0,
                  &result.m128i_u32[3]);

    return result;
}

The Microsoft Visual Studio compiler produces the following:

    ; standard function prologue
        push     ebp
        mov      ebp, esp
        and      esp, -16   ; room for _result on the stack
        sub      esp, 24

    ; load up x and y
        movq     xmm1, QWORD PTR _y$[ebp]
        movq     xmm2, QWORD PTR _x$[ebp]

    ; duplicate x
        punpckldq xmm1, xmm1

    ; make another copy of the duplicated x
        movaps   xmm0, xmm1

    ; duplicate y
        punpckldq xmm2, xmm2

    ; multiply main terms, result in xmm0
        pmuludq xmm0, xmm2

    ; shuffle and multiply cross terms, cross-terms in xmm1
        pshufd   xmm1, xmm1, 141
        pmuludq xmm1, xmm2

    ; Now the adjustments for cross-terms

    ; save a register
        push     esi

    ; save result to memory
        movaps   XMMWORD PTR _result$[esp+32], xmm0

    ; load up result[2] into esi and result[3] into ecx
        mov      esi, DWORD PTR _result$[esp+40]
        mov      ecx, DWORD PTR _result$[esp+44]

    ; load result[1] into edx by shifting
        psrldq   xmm0, 4
        movd     edx, xmm0

    ; xmm0 holds cross-terms
        movaps   xmm0, xmm1

    ; load crossterms[0] into eax
        movd     eax, xmm1

    ; prepare to load crossterms[1] into eax by shifting
        psrldq   xmm0, 4

    ; add crossterms[0] to result[1]
        add      edx, eax

    ; load crossterms[1] into eax
        movd     eax, xmm0

    ; xmm0 holds cross-terms (again)
        movaps   xmm0, xmm1

    ; prepare to load crossterms[3] into eax by shifting
        psrldq   xmm1, 12

    ; prepare to load crossterms[2] into eax by shifting
        psrldq   xmm0, 8

    ; add crossterms[1] to result[2], with carry
        adc      esi, eax

    ; load crossterms[2] into eax
        movd     eax, xmm0

    ; propagate carry into result[3]
        adc      ecx, 0

    ; add crossterms[2] to result[1]
        add      edx, eax

    ; load crossterms[3] into eax
        movd     eax, xmm1

    ; save final result[1]
        mov      DWORD PTR _result$[esp+36], edx

    ; add crossterms[3] to result[2]
        adc      esi, eax

    ; save final result[2]
        mov      DWORD PTR _result$[esp+40], esi

    ; propagate carry into result[3]
        adc      ecx, 0

    ; save final result[3]
        mov      DWORD PTR _result$[esp+44], ecx

    ; load final result
        movaps   xmm0, XMMWORD PTR _result$[esp+32]

    ; clean up stack and return
        pop      esi
        mov      esp, ebp
        pop      ebp
        ret

I was impressed that the compiler was able to convert our direct accesses to the internal members of the __m128i into corresponding shifts and extractions. (Since this code was compiled with only SSE2 support, the compiler could not use the pextrd instruction to do the extraction.)

Even with this very lame conversion, the compiler does quite a good job of optimiznig the code. The opening instructions match our handwritten assembly almost exactly; the second half doesn't match as well, but that's because the compiler was able to replace many of our memory accesses with register accesses.

The compiler was able to optimize our inline assembly!

We'll take this as inspiration for trying to get rid of all the memory accesses. The story continues next time.

Bonus chatter: In the three years since I wrote the original article, nobody picked up on the fact that I got the parameters to _MM_SHUFFLE wrong.

Bonus bonus chatter: I could have reduced the dependency chain a bit by tweaking the calculation of xDDCC:

    // pshufd xmm2, xmm1, _MM_SHUFFLE(0, 0, 1, 1);
    //                      // xmm2 = { D, D, C, C } = { *, D, *, C }
    auto xDDCC = _mm_shuffle_epi32(x00CD, _MM_SHUFFLE(0, 0, 1, 1));

    // punpckldq xmm1, xmm1 // xmm1 = { C, C, D, D } = { *, C, *, D }
    auto xCCDD = _mm_unpacklo_epi32(x00CD, x00CD);

Basing the calculation of xDDCC on x00CD rather than 0xCCDD removes one instruction from the dependency chain.

Bonus bonus bonus chatter: I chose to use punpckldq instead of pshufd to calculate xCCDD because punpckldq encodes one byte shorter.

Comments (11)
  1. Rick C says:

    ” nobody picked up on the fact that I got the parameters to _MM_SHUFFLE wrong.”

    Probably because it’s extra work having to go look the documentation up.

  2. Martin says:

    I didn’t pick up on the wrong parameters, either. Of course, I just come here for the free cookies and interesting stories (history, jargon speak, and the occasional non-computer ones).

  3. I am still trying to parse the title of this article with limited success.

    1. Tanveer Badar says:

      You too?

      1. Pierre B. says:

        I’m pretty sure it’s intentional and Raymond did it for the lols, here’s the decryption:

        1. Given a multiplication op that:
        1a. Takes two single-precision integers
        1b. Result in a double-precision integer

        2. Create a multiplication op that:
        1a. Takes two double-precision integers
        1b. Result in a quad-precision integer

        3. Using only intrinsics.

        1. Huzzah! That clears it up.

          Have an internet cookie. It’s on the house. No, you can’t opt out.

        2. It wasn’t for lols. What would you have titled it? Okay, maybe “Using intrinsics to create double-precision integer multiplication with a quad-precision result from single-precision multiplication with a double-precision result (part 1)” would have been better.

          1. JM says:

            That is still way too long, you have to think of modern readership. I recommend “The Top 10 Reasons Why Weight Loss Programs Don’t Work (And Integer Multiplication)”.

          2. Pierre B. says:

            You don’t need to sell the full product in the title.

            “Creating a wider multiplication operator using intrinsics” is short enough. Let the meat be in the text. I’m no SEO expert though. It may turn out that having a longer title helps with the search ranks.

          3. The blog server has a requirement that no two blog entries have the same title, so I try to make the title specific enough that I won’t try to reuse it in the future for a related issue. (For example, maybe in the future I need to build a different type of wide multiplication using intrinsics.) Also, I put all the words in there so I can find the article myself. Because this was a real problem I had to solve.

        3. Rick C says:

          I had to spend a while parsing it too. It helps being old enough to remember that on 8-bit processors, you had to do the same kind of thing to get 16-bit results from arithmetic operations on 8-bit numbers.

Comments are closed.

Skip to main content