If you haven’t read previous posts in this series about auto-vectorization, you may want to begin at the beginning.

In the last post, we explained the rules for C++ loop headers that make your code amenable to auto-vectorization; we might say, make it *auto-vectorizer-friendly*. Prompted by an online question from NitPick (our astute reader, who seems to enjoy dual citizenship in the blog world, and in the real world), let’s tidy off that discussion with a further, fourth rule:

*Loop counter must be of type int or size_t*

Consider the following, proto-typical loop:

- for (int n = L; n < U; ++n) a[n] = b[n] + c[n];

This will vectorize. And likewise, if we change *int *to *size_t*, it will also vectorize. (In the latter case, you would get a C4018 warning that *n < U* is a signed/unsigned mismatch, unless you were careful to define *U* as an unsigned type)

[NitPick asks: What about other data types as loop counter? Answer: we have *char*, *short*, *int*, *long* and *long long*; together with their 5 *unsigned* cousins. Some of these do indeed inhibit auto-vectorization in this first release, related to the indexing operations *a[n]*, *b[n]* and *c[n] *within the loop body. You can run the experiment quite easily, testing all 10 cases. You will find that any loop variable, which is *unsigned*, and narrower than the platform’s natural integer size, will inhibit vectorization. So, on x86, where the natural integer is 32 bits wide, *unsigned char* or *unsigned short*, don’t work. On x64, where the natural integer is 64 bits wide, *unsigned char*, *unsigned short*, *unsigned long* don’t work. (Recall that *long* is just 32 bits wide). But it’s simpler to use *int* or *size_t* and avoid the headache]

OK, if you follow these four simple rules for the loop header, the auto-vectorizer can start its next step of working out how to vectorize the loop body. So let’s now explore the patterns of loop body the auto-vectorizer recognizes.

Remember, a couple of posts ago, I left you with a puzzle? – a list of six, one-line loops, and asked you to work out which ones might vectorize and which not? In fact, the auto-vectorizer hits all but one of them. Let’s explain the examples, one-by-one.

#### Example 1 – Embarrassingly Parallel

- for (int n = 0; n < 1000; ++n) a[n] = b[n] + c[n];

This loop is very simple – in the literature of parallel processing, it’s called “embarrassingly parallel”; I’ve also come across the description “delightfully parallel”. Each iteration of the loop operates upon memory locations that are different from every other iteration. There is no overlap, or *interference* between iterations. For example, when *n* = 8, we have *a[8] = b[8] + c[8]*. And there is no other value of *n* for which this loop accesses those same three locations.

[NitPick objects that the array called *a* could overlap array *b* in memory. This would cause iterations to *interfere *with each other. This problem – where the same memory locations can be accessed by different names in the source program – is called *aliasing*. It’s a bane in the life of every optimizing compiler. The auto-vectorizer includes runtime checks against such aliasing. We will explain those in a later post. For the moment, let’s assume that our examples do *not* include any aliasing. Thank you NitPick, for keeping us on the right path]

The auto-vectorizer will transform this loop into its equivalent (we are using the same ‘slicing’ pseudo-code introduced in previous posts):

- for (int n = 0; n < 1000; n += 4) a[n:n+3] = b[n:n+3] + c[n:n+3];

This new loop will execute 250 iterations, calculating 4 results each time. On the first iteration it will calculate *a[0:3] = b[0:3] + c[0:3]*. On the next, *a[4:7] = b[4:7] + c[4:7]*. And so on.

#### Example 2 – Non-Divisible Upper Bound

- for (int n = 0; n < 999; ++n) a[n] = b[n] + c[n];

The only change from Example 1 is that the loop’s upper bound is 999, rather than 1000. It’s not divisible by 4 (remember, we are assuming our array elements are *float*s or *int*s, where each SSE vector register can hold 4 at-a-time). The auto-vectorizer transforms this loop as before, but is smart enough to know it must add scalar additions for the last 3 iterations. In effect, it transforms the code as follows:

- for (int n = 0; n < 996; n += 4) a[n:n+3] = b[n:n+3] + c[n:n+3];
- a[996] = b[996] + c[996];
- a[997] = b[997] + c[997];
- a[998] = b[998] + c[998];

#### Example 3 – Calling Math Functions

- for (int n = 0; n < 1000; ++n) a[n] = sin(0.01 * n);

This example is interesting because it includes a call to a math library function (*sin*, in this case). The auto-vectorizer includes vectorized versions of all the math functions defined in “math.h”. The vector library calculates four *sin*s on each call. We thereby vectorize this loop too.

#### Example 4 – Iota

- for (int n = 0; n < 1000; ++n) a[n] = n;

It might appear that this loop would pose a problem for the auto-vectorizer. Just think for a moment: how would you make use of vector instructions to implement this loop? . . .

In fact, the auto-vectorizer defines a variable, let’s call it *seq* for the sake of argument, as a short, 4-element array in memory, holding the values *[0,1,2,3]*

[NitPick asks what this notation means – a list of numbers enclosed in square brackets. Answer: we just invented it, to simplify our explanations. It’s not standard C++. In this blog post, it denotes a list of numbers, stored in consecutive locations – either in memory, or as adjacent fields within a vector register. NitPick: but, doesn’t this mean that *[3]* is ambiguous – since it could denote a vector with a single element, or the operation of indexing into element number 3 of an array? Yes, but we promise never to use a vector that has less than 2 elements in order to avoid this flaw in our notation, honest]

The auto-vectorizer transforms the original loop to look like this:

- seq = [0,1,2,3];
- XMM0 = [0,0,0,0];
- for (int n = 0; n < 1000; n += 4) {
- a[n:n+3] = seq + XMM0; // [0,1,2,3] + [0,0,0,0]
- XMM0 += [4,4,4,4]; // [4,4,4,4], [8,8,8,8], etc.
- }

Line 1 is where our variable *seq* is set up to hold the four values *[0,1,2,3]*. Line 2 sets up the SSE register called *XMM0* to hold the vector *[0,0,0,0]*. In line 4, we set *a[n:n+3]* to hold the vector addition of *seq* with *XMM0*. In line 5, we vector-add *[4,4,4,4]* into *XMM0* (there’s an SSE instruction for that!). So we see that the expression *seq + XMM0* will take on successive values *[0,1,2,3]*, *[4,5,6,7]*, etc. Quite neat, right?

[NitPick again: why did you call this example “Iota”? Answer: see iota]

#### Example 5 – Forward Dependency

- for (int n = 0; n < 500; ++n) a[n] = a[n+1];

Suppose that we have populated the array *a* by setting *a[n] = n* throughout. So *a = [0,1,2,3,4…999]*. * *The loop above, run on a single core, with no vectorization, will therefore compute the result *a = [1,2,3,4…500]*. Does this vectorize safely? In other words, does the following pseudo-code calculate the correct answer:

- for (int n = 0; n < 500; n += 4) a[n:n+3] = a[n+1:n+4];

Let’s work it out, on paper. On the first iteration, we calculate *a[0:3] = a[1:4]*. In pseudo-code, this step is performed as:

- XMM0 = a[1:4] = [1,2,3,4];
- a[0:3] = XMM0;

In other words, we copy the values of *a[1:4]* into *XMM0*, then write those 4 values into *a[0:3]*. Similarly for the next iteration. A little doodling with pencil and paper should convince you that this transformation is safe. The auto-vectorizer reaches the same conclusion and performs this transformation.

Note that this calculation involved dependence between two different iterations of the same loop. To explain what this means, consider the original loop, and its first two iterations:

- a[0] = a[1]; // original
- a[1] = a[2]; // original

Notice how *a[1]* is read on the first iteration, and then updated on the second. So we have to perform the calculations in this order. We *cannot* switch their order like this:

- a[1] = a[2]; // switched
- a[0] = a[1]; // switched

Because this would result in making *a[1]* = 2 and then also making *a[0]* = 2. There is dependence between the original first two iterations (in fact, between each iteration and its neighbor): the second original iteration **W**rites the value of a[1] after the previous original iteration has **R**ead that same value. Hardware engineers call this a Write-After-Read, or WAR, dependence. Compiler writers call it an anti-dependence. In the same way that we cannot switch the order of these two steps and get the right answer, neither can we run them in parallel on different processors. But we *can *run them in parallel in a vector register. We will touch upon the subject of dependence more, in a future post. For the moment, just be grateful that the auto-vectorizer takes care of these worries on your behalf.

#### Example 6 – Backward Dependency

- for (int n = 1; n < 500; ++n) a[n] = a[n-1];

This example differs from the previous *only *in that the right hand side of the assignment indexes *[n-1]* rather than *[n+1]*. The consequences are profound: this one does *not* vectorize safely! I’ve brought it into the picture, in order to break the illusion that the auto-vectorizer can conquer any old loop it comes across in your code.

Let’s look at what the simple scalar loop does. Again we’ll assume that we have populated *a = [0,1,2,3,4…999]*. We get the sequence of assignments:

Iteration |
Assignment |
a, after assignment |

[0, 1, 2, 3, 4, 5 … 999] | ||

First |
a[1] = a[0] | [0, 0, 2, 3, 4, 5 … 999] |

Second |
a[2] = a[1] | [0, 0, 0, 3, 4, 5 … 999] |

Third |
a[3] = a[2] | [0, 0, 0, 0, 4, 5 … 999] |

So the loop will set all the elements 0 thru 499 to the value of *a[0]*. A strange thing to do perhaps, but that is what the original code specified. What would happen if we blindly vectorize this loop? – if we transform it to become:

- for (int n = 1; n < 500; ++n) a[n] = a[n-1];

On the first iteration, we calculate *a[1:4] = a[0:3]*. Recall that, in pseudo-assembly-code, this step is performed as:

- XMM0 = a[0:3]; // [0,1,2,3]
- a[1:4] = XMM0; // [0,1,2,3]

So, starting with *a = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14…999]*, we obtain, after this first vectorized iteration, *a = [0,0,1,2,3,5,6,7,8,9,10,11,12,13,14…999]*. Already our answer differs wildly from the *correct *answer, which, at this stage, should be *a = [0,0,0,0,0,5,6,7,8,9,10,11,12,13,14…999]*. The following diagram depicts what’s going on:

0 |
1 |
2 |
3 |
4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

0 | 0 |
1 |
2 |
3 |
5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

The first row depicts the original *a* – the ‘before’* *picture. The 4-element vector *[0,1,2,3]* shown in bold-red, is used to stomp over the original entries for *a[1:4]*, yielding the second row – the *after *picture. The answer is already wrong. And it just gets worse as we complete successive vector iterations. Here we see the next step in the evolving carnage:

0 | 0 | 1 | 2 | 3 |
5 |
6 |
7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |

0 | 0 | 1 | 2 | 3 | 3 |
5 |
6 |
7 |
9 | 10 | 11 | 12 | 13 | 14 |

Anyhow, you get the idea: if the auto-vectorizer were to blindly vectorize this loop, it would produce wrong answers. So it doesn’t! Auto-vectorization is safe – it will never vectorize a loop if that would produce wrong answers – answers that differ from the single-thread, scalar calculation implied by the original C++ code.

At this point, you might ask how the auto-vectorizer can examine a C++ loop, and tell whether it’s safe to vectorize. However, that would lead us into a topic called “loop dependence analysis”. Not sure whether anyone is interested. Please add a comment below if you’d like a post explaining what’s going on.

Meantime, let’s keep with practical concerns: in the next post, I’ll explain how you find out which loops the auto-vectorizer vectorized, and which not.

Hi Jim,

Re: Example 4

Indeed this is code you generate, but why not generate a smaller (on my sandy bridge %10 faster) code?

seq = [0,1,2,3];

XMM0 = [0,0,0,0];

for (int n = 0; n < 1000; n += 4) {

a[n:n+3] = seq;

seq += XMM0

}

Note: one less addition

Oops, was too quick to post

Should read XMM0 = [4,4,4,4]

However the code I see generated in VS11 beta is actually different than both of these version.

My version is still 10% faster

re: Example 5

Didn't vectorize until I changed a from vector<int> to int *

Is this a beta issue?

Thanks

Q: Do all array accessed need to be in the form array[index]?

It seems that if a reference is involved then vectorization doesn't happen. For example:

for (size_t i = 0; i < length; ++i) { // this is vectorized

entries[i].fileName += offset;

entries[i].objName += offset;

}

for (size_t i = 0; i < length; ++i) { // this isn't

DebugEntry &entry = entries[i];

entry.fileName += offset;

entry.objName += offset;

}

Where DebugEntry is struct DebugEntry { char *fileName, *objName; }, entries is DebugEntry * and length is size_t.

What about more complex loops like:

for(…) {

a[i] = b[i] + c[i] * 3;

if (a[i] > 10) a[i] = 10;

b[i] = min(a[i], c[i]);

}

What restrictions are there on the "size" of the loop body and on the operations in it? (Assuming that the iterations are provably independent). Would a 20 line loop body be vectorized? What kinds of branching are allowed?

Hi Jim,

Example 3:

How would define the /fp:precise with respect to vectorization.

I would define as: the values produced in the loop are binary equal regardless of whether the loop is vectorized or not.

With that definition: We seems to agree that sum += a[i] won't be vectorized without /fp:fast

However we seem to disagree on example 3. Special vector functions used by vectorizer traditionally trade ulps (sometimes as little as 0.5ulp) for speed.

My point is that I would count Example 3 being vectorized under /fp:precise as a bug in VS11 beta.

p.s. sorry of all this dump here. feel free to delete my comments

@NitPick:

The machine code for Example 4 (Iota) was a simplified picture. Below is the actual code. This is taken from an x64 Release build, by setting and hitting a breakpoint in the loop, and looking at the "Disassembly" window:

int main() {

movdqa xmm1,xmmword ptr [__xmm@00000003000000020000000100000000]

xor eax,eax // n = rax = 0

lea rcx,[a] // rcx = &a[0]

nop word ptr [rax+rax] // 15-byte nop, for padding

movd xmm0,eax // xmm0 = [0,0,0,0]

add eax,4 // n = eax += 4

lea rcx,[rcx+10h] // rcx += 16 (&a[n])

pshufd xmm0,xmm0,0 // xmm0 = xmm00

paddd xmm0,xmm1 // xmm0 += xmm1

movdqu xmmword ptr [rcx-10h],xmm0 // a[n:n+3] = xmm0

cmp eax,3E8h // eax < 1000?

jl main+20h // yes, so loop

}

I've added comments to say what each line does. Notice that "xor eax,eax" actually clears all bits in rax. The notation xmm00 refers to the low 32 bits of the xmm0 register – just like in the "Registers" window.

Can you reply with the machine code (or intrinsics) in your faster version, please?

Re: Example 3, do namespace qualifications have an impact? E.g., if I include cmath instead of math.h and use std::sin instead of sin, does it still vectorize?

@NitPick:

Example 5 – right, vector<int> does not vectorize, but int[] does. We have not fixed this in the next drop, I'm afraid.

@ildjarn:

Nope, it doesn't matter. If you look inside cmath, you can see that it says #include <math.h>

[For the Beta drop, you will find the sources for the CRT in the folder: Program Files (x86)Microsoft Visual Studio 11.0VCcrtsrc )

@Mike:

Yes, creating a ref to the current element in 'entries' is enough to throw the auto-vectorizer off the job, I'm afraid.

In the next drop, we have included a command-line switch:

/Qvec-report:2

which reports which loops were vectorized and which not. For those cases where it does not vectorize, the report includes a reason code. I'll write about this in the next blog post.

@Jim

re: faster code for Iota

Here is the version that actually compiles and runs in VS2011 (think in VS2010 as well)

__m128i step = _mm_set_epi32(4,4,4,4)

__m128i curv = _mm_set_epi32(3,2,1,0)

__m128i *pa = (_m128i *)a;

for(int i < 0; i < size/4; i++) {

_mm_storeu_si128(pa+1, curv);

curv = _mm_add_epi32(curv, step);

}

This loop inside of another loop run in 570ms

The VS2011 loop runs in 690ms

Now aligning data by optionally dealing with the first unaligned ints manually

Results in 530ms

Yes, this is on sandy bridge CPU and aligned writes are still very important.

So, I am wrong it's more like 30% difference from what I can do by hand.

Way better than VS2010 would produce though.

@NitPick: re, /fp:fast

No, our vector math library runs the same accuracy as you get for scalar /fp:precise. In fact, even if you set /fp:fast, we still run to that same high accuracy. (The details on algorithms and testing are detailed in "Software Manual for the Elementary Functions" by Cody & Waite)

In the past, vector libraries did often compromise on accuracy – especially for graphics, where the required accuracy comes down to plus-or-minus one pixel, or about one part in 2000. But we don't do that in auto-vectorization.

re: /fp:fast

Simple test (assume a and b are vector<double>

for (int n = 0; n < size; ++n) a[n] = sin(double(n));

#pragma loop(no_vector)

for (int n = 0; n < size; n++) b[n] = sin(double(n));

First loop calls ___vdecl_sin2 (which forwards to ___sse4_sin2 on my machine)

Second loop calls __libm_sse2_sin_precise

Now, I am testing if there is any difference

for (int n = 0; n < size; n++) if(a[n] != b[n]) ….

There are plenty. For example for n = 4,10,11,14…..

They all differ in the last bit (great, that the 0.5 ulp I mentioned before)

I will try to check if in fact second version is more accurate.

My installation of vs11 beta doesn't have source for those functions, to easily check the source of the differences…

But, it's not even my point.

My main point is that vectorization changed the result. To me that should not be allowed under precise model.

@JimHogg

What about reduction loops? You've shown in the Channel9 interview that this vectorizes:

for(int i=0; i<N; ++i) sum += a[i];

but I've tried and saw that this doesn't:

for(int i=0; i<N; ++i) sum += a[i]*b[i];

This is not a contrived example – most linear algebra operations consist in such sums of products (vector-vector dot product, matrix-vector product, matrix-matrix product..). Vectorizing these could be a huge performance benefit. I also tested that Intel compiler successfully vectorizes this particular loop.

Are there rules for which reductions currently vectorize and which do not?

thanks,

-Ofek

@NitPick: re /fp:fast

First, to put things in perspective for our readers, we are talking here of a difference in the very last bit of a double's 53-bit significand.

Next, you proposed that /fp:fast is required if the vector result is not bit-for-bit identical with the scalar /fp:precise. This, I would submit, is wrong. I would argue that the obvious, and correct criterion is: the scalar /fp:precise and vector results must be "equally accurate". By "equally accurate" I mean that if the first lies within plus-or-minus N bits of the infinite-precision result – then so must the second.

Your measurements show that results of sin(x) for scalar /fp:precise and vector differ by at most 1 bit in 53. This is an astounding level of consistency! In fact, because the N-bit tolerance spans either side of the mathematical result, it's likely that some of the vector results lie closer to the correct result than the scalar /fp:precise. I therefore re-assert that this example of vectorization does NOT need to seek the "protection" of /fp:fast.

[However, we are in agreement that the "sum" example is only legitimate under /fp:fast, because the algorithm assumes associativity of floating-point addition. I guess I should explain this in a later post]

[In passing, it's just too tempting to contrast the difference between these mathematical results, and the reality of the phyiscal world we inhabit: an accuracy of 1 bit in 53 corresponds to around 1 in 10^16. This corresponds to measuring the distance between Seattle and New York to within about 4 Angstrom – the same scale as the diameter of an atom. And it exceeds, by many orders of magnitude, mankind's best endeavours to measure experimentally, the fundamental physical constants such as G, the electron mass, or the fine structure constant]

I wonder if it would be more efficient to use 'filling' elements when the upper bound is non-divisible. Then scalar additions would not be necessary. But perhaps setting up those filling elements is more expencive than those 1 to 3 scalar additions.

@Chris:

Sorry, I don't understand the "filling" elements suggestion. Can you explain a little more, please?

My question was based on this sentence: Example 2 – "The auto-vectorizer transforms this loop as before, but is smart enough to know it must add scalar additions for the last 3 iterations."

Would it be more efficient to still use vector additions for those last 3 iterations? It would be possible to fill the vector registers with only 3 values, and make the 4th value be a filling value. Later those filling values can just be discarded.

@OfekShilon:

Hi Ofek. I enjoyed your blog on the Auto Vectorizer. Here's the cross-link for readers – hope you don't mind.

thetweaker.wordpress.com/…/a-day-with-vs11-beta-part-2-auto-vectorizer

On auto-vectorizing the "dot-product" example – yes, we do get that one.

For the int case, it just works – well, so long as you prevent Dead-Code-Elimination (DCE) optimization from throwing away the entire computation!

For the float case, we do NOT auto-vectorize for the default /fp:precise setting. You need to throw /fp:fast, and then we DO auto-vectorize.

Here is a snippet that demonstrates. The two "cout" calls keep the results "live" and prevent DCE.

[I checked this on latest bits. I *think* it worked on Beta bits too, but I've not checked back]

int inta[1000], intb[1000];

float floata[1000], floatb[1000];

int main() {

int intsum = 0;

for (int n = 0; n < 1000; ++n) intsum += inta[n] * intb[n];

cout << intsum << endl;

float floatsum = 0.0f;

for (int n = 0; n < 1000; ++n) floatsum += floata[n] * floatb[n];

cout << floatsum << endl;

}

@OfekShilon

Works in beta as well and as Jim said your loop probably didn't vectorize because your a/b vectors are double/float and floating point model is not /fp:fast

For reduction you need that switch

@Jim

Speaking of that loop.

Compiler produces sub-optimal results there as well. It could be something like 40% faster

The issue is that compiler generated mulps follow by addps. Normally CPU has 2 different ports one that deals with mulps and one to deal with addps. So you can achieve 0.5 CPI.

The problem is that in VS11 the addps can proceed because it's waiting on the results of mulps.

Would you like me to post the suggested code using intrinsics?

Typo, the line should read

The problem is that in VS11 the addps can NOT proceed because it's waiting on the results of mulps.

@Chris:

OK, I understand now.

On the final iteration, we would need to store the "remnant" – the 1, 2 or 3 elements from the XMM register – into memory. We cannot store 4 values into memory, then tidy up the mess, of course, since that might just have overshot the array bounds.

Unfortunately storing 1, 2 or 3 elements within the SSE instruction set is a little messy. For a remnant of 1 use MOVSS. For a remnant of 2 use MOVLPS. For a remnant of 3, we would need to shuffle register contents before storing.

So, overall, it's simpler to run a scalar loop on the "remnant". But looking to the future, the AVX instruction set might make your approach easier – and important for loops with a small trip-count.

@Jim

I seriously hope you were kidding with your physics example. While for the final result the last bit may not be important it's accuracy of intermediate calculations that we are often concerned with.

Imagine I have 2 loops and I compute sin one was vectorized and the other wasn't

Now I do (a[i] – b[i]) * 10e20 and what was supposed to be zero is now a very large difference!

People sometimes don't want FMA. Even though FMA gives you better accuracy in the last bit. Even MS compiler supported #pragma fp_contract off

In my mind the entire goal of adding /fp:model is to create a mode for consistent results

@NitPick:

Yes, please post your suggested code patterns. Eric Brumer, one of the Devs on the auto-vectorizer team, will take a look. Thankyou!

@tobi:

The loop with "a[i] = b[i] + c[i] * 3" vectorizes.

If you then add the line "if (a[i] > 10) a[i] = 10" then it still vectorizes. The compiler transforms this into a "min" call (emitting the PMINSD instruction).

If you then add the line "b[i] = min(a[i], c[i])" it won't vectorize.

We don't vectorize "small" loops – it's likely better to do a scalar unroll. "small" is typically a dozen or so iterations.

Branching generally scuppers the auto-vectorizer. (Except where the code implements a "min" or "max" operation, which the auto-vectorizer infers, as in your second line). Likewise, multi-exit loops (break or return) don't vectorize.

I will be exploring more of such "patterns that vectorize" in the blog posts ahead.

Why are you so picky on the type of the iteration variable? I fear that other compilers will have other, incompatible requirements. Auto vectorization may also fail with boost foreach (no idea what type it uses).

@Bruno:

It just came down to what the team could get done in the time available. We have a long list of additional work to tackle for the next release, extending the patterns we cope with, and filling in annoying little holes, like restrictions on the type of the induction variable.

@Jim

re: for(int i=0; i<N; ++i) sum += a[i]*b[i];

Assuming pa and pb are "redefined" as

__m128 *pa = (__m128 *)(a);

__m128 *pb = (__m128 *)(b);

VS11 generates roughly this code (re-written with intrinsics)

__m128 psum = _mm_setzero_ps();

int loopSize = size / 4;

for (int i = 0; i < loopSize; i++) {

__m128 p = _mm_mul_ps(pa[i], pb[i]);

psum = _mm_add_ps(psum, p);

}

// Eat remainder and combine the results

Note, I am still at the mercy of compiler when it comes to register selection, re-use etc. when I use intrinsic functions. VS11 did a superb job for me these loops (I recall VS2010 before SP1 didn't, in case someone wants to try with older compiler)

Next step is to unroll the loop (this will start processing 8 elements at a time for floats)

Note: there is absolutely NO performance improvement when I'll do that.

for (int i = 0; i < loopSize; i+=2) {

__m128 p = _mm_mul_ps(pa[i], pb[i]);

psum = _mm_add_ps(psum, p);

p =_mm_mul_ps(pa[i+1], pb[i+1]);

psum = _mm_add_ps(psum, p);

}

Compiler nicely fold constants offsets (+1) .

And finally the important step I mentioned above

__m128 psum1 = _mm_setzero_ps();

__m128 psum2 = _mm_setzero_ps();

int loopSize = size / 4;

for (int i = 0; i < loopSize; i+=2) {

__m128 p = _mm_mul_ps(pa[i], pb[i]);

psum1 = _mm_add_ps(psum1, p);

p =_mm_mul_ps(pa[i+1], pb[i+1]);

psum2 = _mm_add_ps(psum2, p);

}

psum1 = _mm_add_ps(psum1, psum2);

// the remainder of the code is now almost the same as in the first loop

Now, the loop contains exactly the same instruction one by one as the unrolled version!

Except that it uses one more register and multiplication and additions are now independent.

As the reslult CPI is now at 0.34 (much better) and the loop is now 40% faster (for my CPU/array sizes etc…)

@NitPick:

Thankyou. Disappointing there was no improvement on unrolling the vectorized loop – maybe upping the unroll to 4 or 8 might see some win? Anyhow, I will pass this info on to Eric, and the rest of the Dev team, as we plan work for Dev12.

@Jim

The fact that there was no improvement for just unrolling is a good thing. CPU works really hard to "unroll" your loop on its own. Things like register renaming and loop detection play a big role here. In fact in the compiler you do NOT want to unroll for modern CPUs where it's no needed as this will only pollute cache and sometimes defeat loop detector etc… If there would be no data dependency you would not want to unroll this loop.

I showed the unrolled loop to emphasize that it is not unrolling but rather data dependency that is a problem

Does VC11 generate SSE4.1's BLENDPS/BLENDVPS and their double-precision/AVX variants?

@NitPick:

Eric has kindly offered to add a post that gives a detailed analysis of the loop performance. (May be a week or two until he clears urgent work on VS11)

@ysitu:

Not sure whether we emit those particular SSE4.1 instructions. (We don't emit any AVX instructions in this release)

In the blog so far, we haven't dug down into the actual machine code instructions the compiler emits. But it's perhaps worth explaining to folks that the SSE set is not a small addition by Intel or AMD! SSE adds around 230 instructions to the already huge x86/x64 set.

Example 6:

I am assuming that it can vectorize this:

for (int n = 0; n < 1000; ++n) a[n] = x;

and if it can, then shouldn't it be able to vectorize this:

for (int n = 1; n < 500; ++n) a[n] = a[n-1];

by converting it to

for (int n = 1; n < 500; ++n) a[n] = a[0];

Is this just a "for the future" case?

@Jim Hello

I've been trying to auto-vectorize a simple "max" loop using Visual, with no success up to now.

I'm using Visual 2012 update 4.

I'm surprised you say to @Toby:

If you then add the line "if (a[i] > 10) a[i] = 10" then it still vectorizes. Branching generally scuppers the auto-vectorizer. (Except where the code implements a "min" or "max" operation, which the auto-vectorizer infers, as in your second line).

I've been trying just that, the exact same version, but it fails, giving reason 1100 :

Loop contains control flow—for example, "if" or "?".

It doesn't like something like :

table[i] = table[i] > constant ? table[i] : constant;

giving reason 1304 : Loop includes assignments that are of different sizes.

which is not correct, because all elements in this line of code are unsigned (U32).

So it seems there's a limitation regarding min/max auto-vectorization on VC2012u4.

Or is there another way to make it work ?

How to do a Loop in a loop in a loop sir….

JUST COMMENT ME BACK….

@Jim Hogg

Sir post in the comment….the loop in a loop in a loop

This would seem to be an example of Embarrassingly Parallel, but it won't vectorize for me.

Anyone shed some light as to why?

__declspec( align( 16 ) ) class PhysicsSystem

{

public:

static const int32_t MaxEntities = 65535;

__declspec( align( 16 ) ) struct VectorizedXYZ

{

double mX[ MaxEntities ];

double mY[ MaxEntities ];

double mZ[ MaxEntities ];

VectorizedXYZ()

{

memset( mX, 0, sizeof( mX ) );

memset( mY, 0, sizeof( mY ) );

memset( mZ, 0, sizeof( mZ ) );

}

};

void Update( double dt )

{

for ( int32_t i = 0; i < MaxEntities; ++i ) <== 1200

{

mTmp.mX[ i ] = mPos.mX[ i ] + mVel.mX[ i ] * dt;

mTmp.mY[ i ] = mPos.mY[ i ] + mVel.mY[ i ] * dt;

mTmp.mZ[ i ] = mPos.mZ[ i ] + mVel.mZ[ i ] * dt;

}

}

private:

VectorizedXYZ mTmp;

VectorizedXYZ mPos;

VectorizedXYZ mVel;

};