# Notes on calculating constants in SSE registers

There are a few ways to load constants into SSE registers.

• Load them from general purpose registers via movd.

• Insert selected bits from general purpose registers via pinsr[b|w|d|q].

• Try to calculate them in clever ways.

Loading constants from memory incurs memory access penalties. Loading or inserting them from general purpose registers incurs cross-domain penalties. So let's see what we can do with clever calculations.

The most obvious clever calculations are the ones for setting a register to all zeroes or all ones.

pxor    xmm0, xmm0 ; set all bits to zero
pcmpeqd xmm0, xmm0 ; set all bits to one

These two idioms are special-cased in the processor and execute faster than normal pxor and pcmpeqd instructions because the results are not dependent on the previous value in xmm0.

There's not much more you can do to construct other values from zero, but a register with all bits set does create additional opportunities.

If you need a value loaded into all lanes whose bit pattern is either a bunch of 0's followed by a bunch of 1's, or a bunch of 1's followed by a bunch of 0's, then you can shift in zeroes. For example, assuming you've set all bits in xmm0 to 1, here's how you can load some other constants:

pcmpeqd xmm0, xmm0 ; set all bits to one
-then-
pslld  xmm0, 30    ; all 32-bit lanes contain 0xC0000000
-or-
psrld  xmm0, 29    ; all 32-bit lanes contain 0x00000007
-or-
psrld  xmm0, 31    ; all 32-bit lanes contain 0x00000001

pxor    xmm0, xmm0 ; xmm0 = { 0, 0, 0, 0 }
pcmpeqd xmm1, xmm1 ; xmm1 = { -1, -1, -1, -1 }
psubd   xmm0, xmm1 ; xmm0 = { 1, 1, 1, 1 }

but that not only takes more instructions but also consumes two registers, and registers are at a premium since there are only eight of them. The only thing I can think of is that psubd might be faster than psrld.

In general, to load 2ⁿ−1 into all lanes, you do

pcmpeqd xmm0, xmm0 ; set all bits to one
-then-
psrlw  xmm0, 16-n  ; clear top 16-n bits of all 16-bit lanes
-or-
psrld  xmm0, 32-n  ; clear top 32-n bits of all 32-bit lanes
-or-
psrlq  xmm0, 64-n  ; clear top 64-n bits of all 64-bit lanes

Conversely, if you want to load ~(2ⁿ−1) = -2ⁿ into all lanes, you shift the other way.

pcmpeqd xmm0, xmm0 ; set all bits to one
-then-
psllw  xmm0, n     ; clear bottom n bits of all 16-bit lanes = 2¹⁶ - 2ⁿ
-or-
pslld  xmm0, n     ; clear bottom n bits of all 32-bit lanes = 2³² - 2ⁿ
-or-
psllq  xmm0, n     ; clear bottom n bits of all 64-bit lanes = 2⁶⁴ - 2ⁿ

And if the value you want has all its set bits in the middle, you can combine two shifts (and stick something in between the two shifts to ameliorate the stall):

pcmpeqd xmm0, xmm0 ; set all bits to one
-then-
psrlw  xmm0, 13    ; all lanes = 0x0007
psllw  xmm0, 4     ; all lanes = 0x0070
-or-
psrld  xmm0, 31    ; all lanes = 0x00000001
pslld  xmm0, 3     ; all lanes = 0x00000008

If you want to set high or low lanes to zero, you can use pslldq and psrldq.

pcmpeqd xmm0, xmm0 ; set all bits to one
-then-
pslldq xmm0, 2     ; clear bottom word, xmm0 = { -1, -1, -1, -1, -1, -1, -1, 0 }
-or-
pslldq xmm0, 4     ; clear bottom dword, xmm0 = { -1, -1, -1, 0 }
-or-
pslldq xmm0, 8     ; clear bottom qword, xmm0 = { -1, 0 }
-or-
psrldq xmm0, 2     ; clear top word, xmm0 = { 0, -1, -1, -1, -1, -1, -1, -1 }
-or-
psrldq xmm0, 4     ; clear top dword, xmm0 = { 0, -1, -1, -1 }
-or-
psrldq xmm0, 8     ; clear top qword, xmm0 = { 0, -1 }

No actual program today. Just some notes from my days writing SSE assembly language.

Bonus chatter: There is an intrinsic for pxor xmmReg, xmmReg: _mm_setzero_si128. However, there is no corresponding intrinsic for pcmpeqd xmmReg, xmmReg, which would presumably be called _mm_setones_si128 or _mm_setmone_epiNN. In order to get all-ones, you need to get a throwaway register and compare it against itself. The cheapest throwaway register is one that is set to zero, since that is special-cased inside the processor.

__m128i zero = _mm_setzero_si128();
__m128i ones = _mm_cmpeq_epi32(zero, zero);
Tags

1. EduardoS says:

This is starting to look like Agner Fog blog, maybe, good things at wrong places?

2. Darius says:

The reason Intel goes with 3 instructions instead of shift by immediate is probably the following: CPUs have just 1 execution port for vector shift.  That moves vector shift instruction into critical path and ties down that port. Vector sub has no such limitation and can choose from several execution ports on latest hardware.

[That makes sense. They should have included the rationale in their recommendation. -Raymond]
3. Joshua says:

I'm surprised that 4 or 5 instructions in setup phase is cheaper than a fetch from memory. (My trick was to place the memory constant immediately above the entry point for high probability of cache hit).

[I ran across this problem in a situation where absolute memory access was not convenient (it's hard to write PIC for x86), so I had to survive entirely in registers. -Raymond]
4. Myria says:

Unfortunately, Visual C++ considers uninitialized variables to always have to come from memory, so the following does not work to do what we want:

__m128i _mm_setmone_si128() { __m128i meow; return _mm_cmpeq_epi32(meow, meow); }

Visual C++ will compile that as allocating 16 bytes of stack memory and reading from it without initializing it, then doing pcmpeqd.  Initializing "meow" with _mm_setzero_epi128 fixes it, at the cost of an extra pxor instruction.

5. Joshua says:

Ewww. I dislike inline assembly.

6. Owen Shepherd says:

Joshua: That's very likely to hit in L2, but not in L1 (because L2 is generally separated between I-Cache and D-Cache).

That said, it's probably as good a place as any.

7. EduardoS says:

Joshua, that's not a good idea outside microbenchmarks because the line will be replicated in L1D and L1I, a constant pool will likely be cached without the risk of being replicated.

But mixing code and data is much more fun, pushing a string parameter is much more sexy this way:

call @F

db "a string parameter!", 0

@@:

8. Nathan says:

Regarding this:

> And if the value you want has all its set bits in the middle, you can combine two shifts (and stick something in between the two shifts to ameliorate the stall):

Wouldn't the fact that most modern processors are out-of-order, superscalar and pipelined ameliorate the stall, obviating the need for the programmer to do that manually?

[It definitely helps, but I prefer to give the CPU the extra scheduling help, in case it needs it. -Raymond]
9. cheong00 says:

After seeking some article, I believe PSUBD is recommended also for the following reason:

Section: Partial register dependencies

The execution pipeline interprets these as dependency-chain breaking idioms; so no stalls will occur on subsequent partial accesses to the register. In other words, even if the register has a read-after-write (RAW) dependency from some earlier instruction, the machine doesn’t need to check for that because all of the bits are going to be set to zero irregardless of what values resided there previously.

Btw, the following article also suggest using pcmpeqd trick for setting all bits to 1:

software.intel.com/…/assembly-language-tips-tricks-for-the-intel-pentiumr-4-processor

So I guess the recommendation changes with time.

10. Josh B says:

@Nathan

The deep pipelining is what makes the stall so bad, and unless the front-end of the CPU can specifically recognize this as an operation that needs to be interleaved and can find suitable code to interleave it with, out-of-orderness won't fix it. There's only so much logic you can jam into the front-end while keeping up throughput, not to mention minimizing power use.

11. voo says:

@cheong00 Am I the only one that's a bit disturbed that Intel uses "irregardless" in that documentation? If nobody bothered to proofread that document one can just hope that the technical advise is more sound

12. cheong00 says:

@voo: To be fair, even books you paid for has spelling mistakes. I'm less concern about this unless the misspelt word can have different meaning in that paragraph.