More notes on calculating constants in SSE registers


A few weeks ago I noted some tricks for creating special bit patterns in all lanes, but I forgot to cover the case where you treat the 128-bit register as one giant lane: Setting all of the least significant N bits or all of the most significant N bits.

This is a variation of the trick for setting a bit pattern in all lanes, but the catch is that the pslldq instruction shifts by bytes, not bits.

We'll assume that N is not a multiple of eight, because if it were a multiple of eight, then the pslldq or psrldq instruction does the trick (after using pcmpeqd to fill the register with ones).

One case is if N ≤ 64. This is relatively easy because we can build the value by first building the desired value in both 64-bit lanes, and then finishing with a big pslldq or psrldq to clear the lane we don't like.

; set the bottom N bits, where N ≤ 64
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift right
64 − N bits
unsigned shift right
64 − N bits
psrlq   xmm0, 64 - N ; 0000 0000 0FFF FFFF 0000 0000 0FFF FFFF
unsigned shift right 64 bits
psrldq  xmm0, 8 ; 0000 0000 0000 0000 0000 0000 0FFF FFFF
 
; set the top N bits, where N ≤ 64
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift left
64 − N bits
unsigned shift left
64 − N bits
psllq   xmm0, 64 - N ; FFFF FFF0 0000 0000 FFFF FFF0 0000 0000
unsigned shift left 64 bits
pslldq  xmm0, 8 ; FFFF FFF0 0000 0000 0000 0000 0000 0000

If N ≥ 80, then we shift in zeroes into the top and bottom half, but then use a shuffle to patch up the half that needs to stay all-ones.

; set the bottom N bits, where N ≥ 80
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift right
128 − N bits
unsigned shift right
128 − N bits
psrlq   xmm0, 128 - N ; 0000 0000 0FFF FFFF 0000 0000 0FFF FFFF
copy shuffle
pshuflw xmm0, _MM_SHUFFLE(0, 0, 0, 0) ; 0000 0000 0FFF FFFF FFFF FFFF FFFF FFFF
 
; set the top N bits, where N ≥ 80
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift left
128 − N bits
unsigned shift left
128 − N bits
psllq   xmm0, 128 - N ; FFFF FFF0 0000 0000 FFFF FFF0 0000 0000
shuffle copy
pshufhw xmm0, _MM_SHUFFLE(3, 3, 3, 3) ; FFFF FFFF FFFF FFFF FFFF FFF0 0000 0000

We have N ≥ 80, which means that 128 - N ≤ 48, which means that there are at least 16 bits of ones left in low-order bits after we shift right. We then use a 4×16-bit shuffle to copy those known-all-ones 16 bits into the other lanes of the lower half. (A similar argument applies to setting the top bits.)

This leaves 64 < N < 80. That uses a different trick:

; set the bottom N bits, where N ≤ 120
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift right 8 bits
psrldq  xmm0, 1 ; 00FF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
signed shift right
120 − N bits
signed shift right
120 − N bits
psrad  xmm0, 120 - N ; 0000 00FF FFFF FFFF FFFF FFFF FFFF FFFF

The sneaky trick here is that we use a signed shift in order to preserve the bottom half. Unfortunately, there is no corresponding left shift that shifts in ones, so the best I can come up with is four instructions:

; set the top N bits, where 64 ≤ N ≤ 96
pcmpeqd xmm0, xmm0 ; FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF
unsigned shift left
96 − N bits
unsigned shift left
96 − N bits
psllq   xmm0, 96 - N ; FFFF FFFF FFF0 0000 FFFF FFFF FFF0 0000
shuffle
pshufd  xmm0, _MM_SHUFFLE(3, 3, 1, 0) ; FFFF FFFF FFFF FFFF FFFF FFFF FFF0 0000
unsigned shift left 32 bits
pslldq  xmm0, 4 ; FFFF FFFF FFFF FFFF FFFF FF00 0000 0000

We view the 128-bit register as four 32-bit lanes. split the shift into two steps. First, we fill Lane 0 with the value we ultimately want in Lane 1, then we patch up the damage we did to Lane 2, then we do a shift the 128-bit value left 32 places to slide the value into position and zero-fill Lane 0.

Note that a lot of the ranges of N overlap, so you often have a choice of solutions. There are other three-instruction solutions I didn't bother presenting here. The only one I couldn't find a three-instruction solution for was setting the top N bits where 64 < N < 80.

If you find a three-instruction solution for this last case, share it in the comments.

Comments (10)
  1. Ryan Phelps says:

    What have you been working on that this stuff is coming up?  Or is it just a hobby?

  2. Al Go says:

    He got tired of counting the ways he could arrange balls into boxes.

  3. Joshua says:

    Ok so it's a bit funny that Mr. Go turned up again, but really, you know, it must be an alias for somebody here commonly. I suppose Raymond could figure it out but I don't think he cares any more than the elephant cares to smite any particular gnat.

  4. JamesNT says:

    You guys can poke fun all you want, but I find these posts fascinating.  It's been a lot of fun looking up more information regarding this topic.

    JamesNT

  5. A regular viewer says:

    These calculations form the foundation for the transposition of algebraic first order polynomials on 2 dimensional N-planar geometry.

  6. Smithers says:

    I believe I have a three-instruction solution which covers 7 of the 14 remaining cases.

    ; set the top N bits, where 72 <= N <= 96

    pcmdeqd xmm0, xmm0

    pslldq xmm0, 7

    psrad xmm0, N-72

    The trick here is to shift further than we need to, then use a signed shift to get some of the ones back.

    E.g. N=77:

    FFFF FFFF|FFFF FFFF|FFFF FFFF|FFFF FFFF

    Unsigned shift left 56 bits

    FFFF FFFF|FFFF FFFF|FF00 0000|0000 0000

    Signed shift right each doubleword N-72 bits

    FFFF FFFF|FFFF FFFF|FFF8 0000|0000 0000

    Unfortunately, we can't do the left-shift by any more than 56 bits without clearing the bottom half completely, so we still can't do 64 < N < 72.

  7. Neil says:

    To set the top 72<N<128 bits to 1:

    pcmpeqd xmm0, xmm0

    pslldq  xmm0, 7

    psrad   xmm0, N - 72

    That still leaves 64<N<72 though.

  8. Sintendo says:

    I was going to suggest using the AMD-exclusive SSE4a instructions 'extraq' and 'insertq' somehow, but I forgot that they only operate on the lower 64 bits and leave the upper half undefined.

  9. gr8 m8 r8 8/8 says:

    a gr8 feature m8 im going have to rate ya 8/8

  10. Neil says:

    I must have had the page open for quite a while before submitting my comment, which explains how Smithers was able to submit his comment without me noticing. I'll just put it down to "Great minds think alike."

Comments are closed.

Skip to main content