Auto-Vectorizer in Visual Studio 2012 – Rules for Loop Header

If you haven’t read previous posts in this series about auto-vectorization, you may want to begin at the beginning.

In this post, we look at those C++ loop patterns that the auto-vectorizer recognizes. We will start by explaining its rules for loop headers (defined below).

Throughout this post, we will talk about what the auto-vectorizer does, in its first release, in Visual Studio 2012.  We will extend the feature in future releases, so this advice will vary as time goes by, as the analyses tackle a wider and wider range of patterns.

Let’s focus on C++ for loops.  (The auto-vectorizer can work for certain kinds of while loops, but we’ll get to those later).  Here is the general pattern of a C++ for loop:

  1. for <loop-header> <loop-body>

Simply expanding <loop-header> we get:

  1. for (int n = L; n < U; n += S) <loop-body>

L is the lower bound of the loop; U the upper bound; S is the step size. (We wrote the test as n < U, but we might equally well have chosen n <= U. Rather than keep repeating that both patterns apply equally well, I will use the first form only).  First rule:

S must be +1

Why?  Consider a simple loop that steps up, 1 at-a-time:

  1. for (int i = 0; i < 1000; ++i) a[i] = b[i] * b[i];

The first four iterations of this loop therefore calculate:

  1. a[0] = b[0] * b[0]; a[1] = b[1] * b[1]; a[2] = b[2] * b[2]; a[3] = b[3] * b[3];

To vectorize this code, we would load b[0:3] into one of the vector registers, XMM0, say.  SSE provides an instruction (MOVUPS, or MOVAPS) that loads these 4 contiguous locations from memory into a vector register in one operation.  Once there, we calculate their squares quickly, again with a single vector instruction (MULPS).  And then we store the 4 results into a[0:3] with another (MOVUPS) single vector instruction. The whole operation is efficient, and a worthwhile optimization.

[NitPick asks: what’s with the words, in upper-case letters, like MOVUPS?  Answer: these are the names of particular SSE (vector) instructions.  I will include them occasionally, for anyone interested in details, but you don’t need to understand them in order to follow this blog.  For a full list of the several hundred SSE instructions available, pick either the online manuals from Intel or online manuals from AMD.  Both are free]

Now consider that same loop, but stepping up by some number, other than 1.  For example, by 3:

  1. for (int i = 0; i < 1000; i += 3) a[i] = b[i] * b[i];

The first four iterations would therefore calculate:

  1. a[0] = b[0] * b[0]; a[3] = b[3] * b[3]; a[6] = b[6] * b[6]; a[9] = b[9] * b[9];

To vectorize this code, we would need to gather the four values b[0], b[3], b[6] and b[9] into one of the vector registers.  In SSE, this is not straightforward, unfortunately.  You need to load b[0] into one XMM register, b[3] into another, then mush them together so they occupy different 32-bit slots in the same XMM register.  Then repeat, with a similar mush operation to get b[6] and b[9] into the vacant slots in that same XMM register.  (For those who might care, check the MOVSS and UNPCKLPS instructions)

Once we have b[0], b[3], b[6] and b[9] nicely tucked into the four 32-bit slots of an XMM register, we can calculate their squares quickly, in a single vector operation. But then we have to scatter the results to a[0], a[3], a[6] and a[9] .  This requires more ‘instruction gymnastics’.

All those instructions required to gather the non-contiguous array elements of b into XMM registers, and later, to scatter results back to the non-contiguous array elements of a, swamps the benefit of the fast multiplication that calculated the squares.  On the whole, for this example, there’s a simpler technique (called “loop unrolling”) using scalar multiplications that runs faster.

Long story short: use a step size of +1 in your C++ for loops.  Put another way, if you have step-down loops such as:

  1. for (int i = N; i > 0; --i) a[i] = b[i] * c[i];

then consider tweaking them to become step-up loops, like this:

  1. for (int i = 1; i <= N; ++i) a[i] = b[i] * c[i];

The next auto-vectorizer rule for the loop header is:

L can be any expression

because in C++ L is evaluated just once, at the start of the loop.  If the expression for L included some variable x, say, then you could even change the value of x later, within <loop-body>.  It doesn’t matter, because that change cannot alter the already-calculated lower bound.  Here’s an example:

  1. int x = 12;
  2. for (int i = x; i < 12345; ++i) { a[i] = x; x += 3; }

This loop will have L = x = 12. Even though <loop-body> changes the value of x, it cannot alter history, as-it-were, to change the fact that L = 12.  L is not re-evaluated as the loop executes.  In summary, L can be any valid expression – it can even include calls to functions.

What about U, the upper bound?  It’s a different matter altogether, and has the following rule:

U must be a constant for the duration of this loop

Recall that C++ does re-evaluate U each time around the loop.  For the auto-vectorizer to be sure that its transformations are safe, it must ensure that U does not change value as the loop executes.  In compiler jargon, we would say that its value is “loop invariant”.  To see why changing the value of U as the loop executes, introduces untold complications into the life of the auto-vectorizer, consider the following, contrived example:

  1. int u = 42;
  2. for (int i = 0; i < u; ++i) { a[i] = i; u -= 9; }

Before any attempt at vectorization, the loop will run like this:

Iteration

i

u on entry to <loop-body>

u on exit from <loop-body>
First 0 42 33
Second 1 33 24
Third 2 24 15
Fourth 3 15 6
Fifth 4 6 -3
Sixth 5 -3 Does not execute

So <loop-body> executes 5 times.  Its net effect is to set a[i] to i for i = 0 thru 4, inclusive.  But working out, at compile-time, how many times the loop will actually execute at runtime (called the loop’s trip count) , is quite tricky.  If you were the auto-vectorizer, what would you do?  Long story short, the auto-vectorizer declines the challenge, and the compiler falls back to generating scalar code.

Perhaps a more common example of this situation is where the upper bound involves a call to a function – fun in the following snippet:

  1. for (int i = 0; i < fun(); ++i) { a[i] = b[i] + c[i]; }

Unless the compiler can prove (to itself) that fun cannot possibly interfere with the loop body, it cannot auto-vectorize this code.

 

That’s probably enough discussion for this post.  We have explained the rules, or constraints, upon for loop headers, so that the auto-vectorizer stands any chance of vectorizing the loop body.  I’ve tried to make those rules clear, and obvious – probably what each of us would have chosen if we had been invited to play the part of a human auto-vectorizer – of transforming a C++ for loop into equivalent vector instructions.

Next time, we will look at the rules for loop bodies, by working through the examples in the puzzle from the last post.

[Our astute reader, NitPick, asked only one question in this post.  Perhaps his concentration lapsed, because it seems like there are a couple more lurking in the text.  Let’s see whether your comments hit the questions that NitPick missed]