Auto-Vectorizer in Visual Studio 2012 – How it Works


 

I assume you have read previous posts in this blog about auto-vectorization, starting with auto-vectorization-overview.

In this post, we will look at how the auto-vectorizer works.

Here is a tiny program that calculates the cube of each element in array b, setting the answer into a matching array called a. Of course, this program makes no real sense, because we have not initialized b.   But for this exercise, we are interested in the code generated by the compiler, which does not depend on the values held in b.

Code Snippet
  1. const int N = 1000;
  2. float a[N], b[N];
  3.  
  4. int main() {
  5.     for(int n = 0; n < N; ++n) a[n] = b[n] * b[n] * b[n];
  6. }

On line 2, we create the two C-arrays, called a and b, each holding 1000 floats, each 32 bits wide.   On line 5, in the main function, we cube b into a, element-by-element.   What kind of machine code will the compiler generate for this loop?

In essence, prior to Visual Studio 2012, each iteration of the loop – in particular, iteration number n – carries out 7 operations:

  1. load the value of b[n] into a register – call it reg0, for the sake of this description
  2. copy reg0 into another register, reg1, say
  3. multiply reg0 into reg1
  4. multiply reg0 into reg1 (again)
  5. copy reg1 into a[n]
  6. increment n by 1
  7. if n is less than 1000 goto step 1

[The astute reader – let’s call him NitPick – might ask why I said “in essence” in the last sentence.  Answer: the compiler will optimize this loop, using constant-folding, 6-fold loop-unroll and strength reduction.  It will also insert multi-byte NOPs to align branch targets.   But this simplified version is near enough to help understand what’s going on.   In a later blog, we might dig deeper into the truth]

That was pre-Visual Studio 2012.   But in Visual Studio 2012, the compiler will generate very different code for this loop, making use of the chip’s SSE instruction set, and its XMM vector registers.  (For a reminder on what this means, see an earlier post on What Is Vectorization.

We can denote the scalar operation (where the compiler, previously, would slavishly calculate one element of the result at a time) as: a[4] = b[4]3

We can denote the vector operation (where the compiler calculates four elements, in parallel, at a time) as: a[4:7] = b[4:7]3

Here, a[4:7] means the elements a[4] thru a[7] taken all together in one chunk.  Note that a[4:7] is not C++ syntax; it’s a neat notation, called array slicing, supported in several other languages, including FORTRAN.   If a and b each holds 1000 elements, we must carry out this operation – not 1000 times, as before – but just 250 times, since we are chomping thru the arrays, 4 elements at a time.

So (again, in essence), each iteration of the vectorized loop – in particular, iteration number n – carries out 7 operations, but each does 4 times as much work as before:

  1. load the value of b[n:n+3] into XMM0
  2. copy XMM0 into XMM1
  3. multiply XMM0 into XMM1
  4. multiply XMM0 into XMM1 (again)
  5. copy XMM1 into a[n:n+3]
  6. increment n by 4
  7. if n is less than 1000 goto step 1

[NitPick asks why I called the registers reg0 and reg1 in the scalar case.  Answer: I was being coy: the real-life registers used are the low 32-bits of vector registers XMM0 and XMM1.  I thought this might be confusing, so held back with the truth.  But note that the operations performed are as I described – just one arithmetic multiplication on one element of b at-a-time]

So much for the theory. How much faster does the code run, in practice?   On my PC, this example loop runs 2.9x faster after auto-vectorization.   Not as much as the theoretical maximum of 4x, but still a welcome boost in speed.

[NitPick again: why is the text sprinkled with italic items like: floats, a[n] and so on?  Answer: they denote items that would normally appear in code, rather than in English prose.  Books normally use a typewriter font for this purpose, as in a[n+1], but I find these hard to spot, and often ugly.  So I am using italic instead]

That’s it for the moment.  As always, please feel free to ask questions in the comments below.

Meantime, let me leave you with a puzzle:

Auto-vectorization performs detailed analyses in order to work out whether it’s safe to vectorize any given loop.   By “safe” we mean that a vectorized version of the loop yields the same answers as a scalar version of that same loop.   Not all loops can be vectorized.   Here are a few examples.   See if you can work out which ones vectorize or not.   (Think whether one iteration of a loop would interfere with other iterations of that loop).   In each case, imagine that a, b and c are declared as 1000-element arrays of float:

Code Snippet
  1. for (int n = 0; n < 1000; ++n) a[n] = b[n] + c[n];
  2. for (int n = 0; n < 999; ++n) a[n] = b[n] + c[n];
  3. for (int n = 0; n < 1000; ++n) a[n] = sin(0.01 * n);
  4. for (int n = 0; n < 1000; ++n) a[n] = n;
  5. for (int n = 0; n < 500; ++n) a[n] = a[n+1];
  6. for (int n = 1; n < 500; ++n) a[n] = a[n-1];
Comments (18)

  1. Adam says:

    I noticed that you didn't specify what type of load is being performed in the vectorized case.

    Although "float a[N], b[N];" doesn't specify alignment, would it be possible for the compiler to specify an alignment of 16 or 32 (for future 256 bit instructions) here rather than the current alignment of 8 here?

    Obviously in the general case, where you are dealing with arbitrary heap allocated memory, you can't make any strong guarantees about alignment. However, when dealing with stack allocated memory as we are in this example, the compiler should have the information needed to be able to make this optimization.

  2. Adam says:

    Pedantic note: Having re-read the example, the arrays here aren't actually 'stack allocated' since they are global variables, however, the reasoning should still hold since they are statically allocated.

  3. Jim Hogg says:

    By default, a float array would align to 4 bytes (static, stack or heap).  

    You could specify a 16-byte alignment with __declspec align(16) float a[N].  For heap, you could call _aligned_malloc.

    However, this alone does not ensure that the auto-vectorizer can assume it is always operating on aligned data (letting it use MOVAPS rather than the slower MOVUPS).  Because, you might call a function, passing (a+1), or (a+x) as an argument.  Then, inside that function, the auto-vectorizer cannot assume its input array is 16-byte aligned.

  4. NitPick says:

    > By “safe” we mean that a vectorized version of the loop yields the same answers as a scalar version of that same loop

    Not quite true for floating point numbers. In fact at some point you will mention the floating point accuracy switch (/fp:fast), without which you won't vectorize any of the floating point loops 😉

    But I am sure that comes later in your series.

    Great series btw!

  5. Dmitry Zaslavsky says:

    icc has different ways of telling compiler that the pointer is

    aligned on some boundary. That will cause icc not to to generate movups or loop-pealing code.

    the performance difference is rather large on anything before sandy bridge (and not ignorable especially on stores even on sandybridge)

    Any such support in vs2011?

  6. Arnaud Faucher says:

    In the first code snippet, what if the number of steps in the loop (N) is not a constant integer, but a variable integer that the compiler could always detect as a multiple of 4 ?

    Would the compiler vectorize the code ?

  7. Jim Hogg says:

    @Arnaud – the "trip count" of the loop does not need to be a multiple of 4 (see puzzle #2).  I'll explain this in the next post.

  8. Jim Hogg says:

    @Dmitry – in VS11 we use unaligned moves.  Yes this is slower than aligned moves (but the penalty reduces on more recent hardware).  As I mentioned in a reply above, simply making all arrays aligned is a good start.  But capitalizing on that alignment requires a mixture of static checks (best done under Whole Program Optimization), plus runtime checks for those left-over cases, leading to loop-peeling and/or mixed aligned/unaligned accesses for different arrays (post Dev11).  However, we're jumping ahead!  Thanks for the question – I'll add "alignment" as a topic for a future post.

  9. Jim Hogg says:

    @NitPick – yes, we will indeed come to /fp:fast in a future post.  

    But meantime, note that not all applicatiions of auto-vectorization on floats require /fp:fast.  For example, a[i] = b[i] + c[i] can vectorize under (our default) /fp:precise.  For other cases, such as reductions (eg: find the sum of all elements in an array), we only auto-vectorize if the user specifies /fp:fast (since the vectorized calculation does not preserve associativity of the floating-point operations under addition).

    (And, altho' I've not mentioned it so far, auto-vectorization can also work on integers, for which the switch does not apply)

  10. Dmitry says:

    Thanks for the reply Jim.

     Your reply to @NitPick is interesting. Once again with icc in background that does loop-peeling by default.

     Would you generate different code based on /fp:fast flag? Meaning with that you can do loop peeling otherwise you would drop to movups?

  11. Adam says:

    A few observations now that I've had the time to play with things in a bit more depth.

    float A[1048576];

    float B[1048576];

    int main()

    {

    for (int j=0; j<10000; j++)

    for (int i=0; i<_countof(A); i++) { B[i] = sqrt(A[i] * A[i] * A[i]) }

    }

    (using /fp:fast)

    VC11 is aligning the global arrays to a 16-byte boundary, whereas vc10 is aligning to a 4-byte boundary as you said. While that is an implementation detail that shouldn't be depended on, its good news. The end result is that on modern hardware, without explicitly specifying alignment, you are going to get good performance in the general case. On my Sandybridge cpu, the timing is essentially the same between the generated auto-vec code (with MOVUPS) and a version I have hex-edited to use MOVAPS instead. Pretty sweet!

    On older hardware (tested on Westmere), the use of MOVUPS here only adds about 5% overhead and that number goes down of course when you become more bound by computation than loads/stores, which is also why I threw in the sqrt computation.

    The other observation was that auto-vectorization seems to be pretty easily defeated. For example, the following loop degrades to serial execution:

    for (int i=0; i<_countof(A); i++) { B[i+0] = sqrt(A[i+0] * A[i+0] * A[i+0]) }

  12. Jim Hogg says:

    @Dmitry – with /fp:fast set, we can auto-vectorize cases that would be illegal under /fp:precise.  A good example is vectorizing a sum, such as:  for (int i = 0; i < N; ++i) sum += a[i].  Although, for typical applications, such as summing, or averaging sample data whose time order has no significance, we could argue it doesn't actually matter.  Anyhow, I'll explain this in a post a couple of weeks ahead.

  13. Jim Hogg says:

    @Adam.  I believe it's fortuitous that this example allocated A on 16-byte alignment.  Trying adding a declaration just before it, like "char c;" and try again.  But great to know the overhead for MOVUPS is small, and becoming smaller.

    Your last example where an an index of [i + 0] defeats the auto-vectorizer?   Umm – well, yes, you're right.  We analyse [i + K], [i – K] where K is a non-zero constant.  We even catch [i + x] where x is a parameter to the current function.  But we miss the particular case of [i + 0],  This one is fixed in a future drop.

  14. @Jim Hogg says:

    Those were my thoughts as well, I had tried a number of different combinations of declarations ensuring that the data was being dynamically written to and read from within multiple functions hoping that they wouldn't be optimized away, but things still seem to be aligned to 16-bytes. I was pleasantly surprised.

    I'm looking forward to the next post in this series!

  15. Adam says:

    I suspect all 6 of those puzzles should vectorize. A human could certainly vectorize them all. I don't have the beta compiler to try them out though.

    1. Is the simplest case.

    2. All but the last three iterations should vectorize.

    3. sin() is easily vectorizable, as is the type conversion. I suspect the tricky bit is the initial setup of putting n into an xmm register as {0, 1, 2, 3} and adding {4, 4, 4, 4} every iteration.

    4. This seems to be a simpler case than #3, so if #3 works this should.

    5. You can easily read 4 and then write elements at a time and get the same effect as a single element at a time. The compiler could also just replace the loop with a call to memmove().

    6. This fills half the array with a copy of the first element. Should be trivial to vectorize if the compiler understands that.

  16. Jim Hogg says:

    @Adam:

    Almost.  Number 6 is the most interesting.  Answers in the next post (due Tuesday 8th May).

  17. DOUBLE SHOT says:

    What of double type? What about CPUs that don't do SSE2?

    ■AMD CPUs prior to Athlon 64, including all Socket A-based CPUs

    ■Intel CPUs prior to Pentium 4

    ■VIA C3

    ■Transmeta Crusoe

    or do these CPUs not run the target OS of VS11?  I though VS11 targets xp, or can be made to, and all those CPIs run XP.  Is the compiler code smart enough to not A-V if the CPU wont?

  18. To DOUBLE SHOT says:

    Time to move on, dude. AMD CPUs prior to Athlon 64, Intel CPUs prior to P4, and others are ancient crufty designs that are being left behind.

    Evolve or die.