Auto-Vectorizer in Visual Studio 2012 – Did It Work?


 

If you’ve not read previous posts in this series about auto-vectorization, you may want to begin at the beginning.

This post will explain how to find out which loops in your C++ program were auto-vectorized.   Here is an example program, stored in a file called Source.cpp, with which to experiment:

  1. int main() {
  2.   const int N = 50;           // array dimensions
  3.   int a[N], b[N], c[N];
  4.   for (int n = 0; n < N; ++n) a[n] = b[n] * c[n];
  5. }

To keep things simple, I have missed out code to initialize the arrays b or c.  However, that doesn’t matter for the purpose of this post.

Running from the Command Line

Let’s start by running this program from the command-line (we’ll explain what to do in the Visual Studio IDE in a few minutes).

cl /c /O2 /Qvec-report:1 Source.cpp

This command tells the compiler to compile Source.cpp, but not to go on and link (that’s the /c switch). The /O2 switch tells the compiler to generate code that is optimized for speed.  This is crucial: the auto-vectorizer kicks in only when you enable optimization.  Finally, the /Qvec-report:1 switch tells the compiler to report which loops were successfully vectorized.  (Remember that these command-line switches are case-sensitive: so spell them as shown).  And here is the output:

Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50520 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

source.cpp

— Analyzing function: main
c:\source.cpp(4) : loop vectorized

This confirms that the loop on line 4 of Source.cpp, was indeed vectorized.

Please note that the /Qvec-report:1 switch is not present in the Beta drop of VS 11 from February.  But it will be included into the next public drop, available soon.

The compiler also provides a /Qvec-report:2 switch.  This one tells you which loops were successfully auto-vectorized, and which were not, with a reason code.  Here is another snippet that includes a second loop (on line 5):

  1. int main() {
  2.   const int N = 50;           // array dimensions
  3.   int a[N], b[N], c[N];
  4.   for (int n = 0; n < N; ++n) a[n] = b[n] * c[n];
  5.   for (int n = 0; n < N; ++n) a[n] = a[n-1] + 7;
  6. }

And here is the corresponding report:

Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50520 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

source.cpp

— Analyzing function: main
c:\source.cpp(4) : loop vectorized
c:\source.cpp(5) : loop not vectorized due to reason ‘1200’

As you can see, the compiler auto-vectorized the loop on line 4 (as before), but failed to auto-vectorize the one on line 5, with a reason code of 1200.  This loop is similar to Example 6 – Backward Dependency that we analyzed in a previous post.  Vectorizing this loop would produce wrong results, and the auto-vectorizer is smart enough to know this.

Before going on to explain the various reason codes, let’s catch up and explain how to request these results from the Visual Studio IDE.

Running from the IDE

For your project, select the “Release” (rather than “Debug”) configuration. (You can check the project properties to confirm that, under the covers, this sets the /O2 switch, just as we did above from the command-line).

In addition, navigate yourself to “Property Pages”, “Configuration Properties”, “C/C++”, “Command Line”, “Additional Options” and add: /Qvec-report:1.  Here’s a screen-shot:

clip_image002

The build shown in the screenshot is for x64, but you can equally well choose x86.  Now, whenever you build the project, the output will include a report saying which loops were successfully vectorized.  As in the case of requesting this report via the command-line, please note that the /Qvec-report:1 switch is not present in the Beta drop of VS11 from February.  But it will be included into the next public drop of VS11, available soon.

Reasons why Vectorization Was Not Possible

Recall that the auto-vectorizer is 100% safe: it will NEVER vectorize a loop if there is the slightest chance the generated code would produce wrong answers – answers different from that implied by the original sequential C++ code.

[NitPick again: what exactly are the answers implied by the sequential execution of a C++ program?  Answer: this is a deep question.  For our tiny examples, we will simply assume the answer is “obvious”.  For the general problem, try a web search for the topic “programming language semantics”]

Ensuring safety requires some pretty deep analysis of the input code.  It turns out that sometimes a loop would actually be safe to vectorize, but the analysis cannot prove it so.  The auto-vectorizer therefore refuses to vectorize that loop.  We say that its judgments are “conservative”.

The warnings from a /Qvec-report:2 run specify any of about 30 reason codes for why a given loop was not vectorized.

The reason codes are discovered and emitted from several layers deep within the compiler.  This can sometimes make it difficult to relate the specific issue back to  the original C++ code, several layers above.  For example, the report may be produced from a loop in a function whose body has been in-lined into its caller – so the original function, at this point in the analysis, no longer exists!  Bear this in mind as you read the explanations below for each reason code.  We will publish a fuller explanation, with examples, as part of MSDN documentation – this will guide you on tweaking your code so that it vectorizes.

 

Reason Code

Explanation

500

This is a generic message – it covers several cases: for example, the loop includes multiple exits, or the loop header does not end by incrementing the induction variable

501

Induction variable is not local; or upper bound is not loop-invariant

502

Induction variable is stepped in some manner other than a simple +1

503

Loop includes Exception-Handling or switch statements

 

Reason Code

Explanation

1100

Loop contains control flow – if, ?:

1101

Loop contains a non-vectorizable conversion operation (may be implicit)

1102

Loop contains non-arithmetic, or other non-vectorizable operations

1103

Loop body includes shift operations whose size might vary within the loop

1104

Loop body includes scalar variables

1105

Loop includes a non-recognized reduction operation

1106

Inner loop already vectorized: cannot also vectorize outer loop

 

Reason Code

Explanation

1200

Loop contains loop-carried data dependences

1201

Array base changes during the loop

1202

Field within a struct is not 32 or 64 bits wide

1203

Loop body includes non-contiguous accesses into an array

Reason code 1200 says the loop contains loop-carried data dependences which prevent vectorization.  This means that different iterations of the loop interfere with each other in such a way that vectorizing the loop would produce wrong answers.  More precisely, the auto-vectorizer cannot prove to itself that there are no such data-dependences. 

[NitPick asks: what is this “Data Dependence” thing you keep dragging into the conversation?  Answer: it lies at the heart of vectorization safety, and uses some interesting math – affine transformations and systems of Diophantine equations.  However, no-one commented last time that they wanted more details, so I’ll skip explanations]

 

Reason Code

Explanation

1300

Loop body contains no (or very little) computation

1301

Loop stride is not +1

1302

Loop is a “do-while”

1303

Too few loop iterations for vectorization to be a win

1304

Loop includes assignments that are of different size

1305

Not enough type information

 

Reason Code

Explanation

1400

User specified #pragma loop(no_vector)

1401

/kernel switch specified

1402

/arch:IA32 switch specified

1403

/favor:ATOM switch specified and loop includes operations on doubles

1404

/O1 or /Os switch specified

The 1400s reason codes are straightforward – you specified some option that is just plain incompatible with vectorization.

 

Reason Code

Explanation

1500

Possible aliasing on multi-dimensional arrays

1501

Possible aliasing on arrays-of-structs

1502

Possible aliasing and array index is other than n + K

1503

Possible aliasing and array index has multiple offsets

1504

Possible aliasing – would require too many runtime checks

1505

Possible aliasing – but runtime checks are too complex

The 1500s reason codes are all about aliasing – where a location in memory can be accessed by two different names.

Finally, note that the reason codes listed above apply to this first release of the auto-vectorizer.  Subsequent releases will likely stop emitting many of these warnings, as we make the compiler ‘smarter’ at recognizing more and more loop patterns.

The topic of aliasing cropped up earlier.  The time seems ripe to explain this term – what it means; why it’s a nuisance; how the auto-vectorizer deals with it. Although the alias analysis performed by a compiler is complex, we can explain the nub of the problem, via examples, in just a few paragraphs.  Let’s aim to do that in the next post.

Comments (23)

  1. Jim Hogg says:

    Just fixed a typo – /arch:ATOM should in fact be /favor:ATOM.  (Thanks to Juan for spotting this)

  2. David says:

    Is there a pragma or similar to have the compiler emit a warning or error if a specific loop wasn't possible to vectorize? This could be very useful if you have some very performance critical code and want to be sure that it vectorizes as part of the build process. I see a #pragma loop(no_vector) mentioned, but it looks like it isn't yet documented.

  3. Jim Hogg says:

    @David:

    No, we don't have a way to "monitor" a given loop (although it has been requested; and we have considered it).  Best alternative is to specify the /Qvec-report and scan the build logs to check for any regression.

    Documentation for auto-vectorization is still in progress.  This blog gives the most up-to-date info, although we are still tweaking final details (like the exact format of the auto-vectorization warning messages).

  4. David says:

    Is there a pragma and flag to be able to force vectorization to happen, similar to #pragma vector in Intel compiler? Will you publish a table comparing vector #pragmas/flag alternatives to the Intel Compiler?

    wiki.duke.edu/…/Intel+Compiler+Optimizations

    Thanks

  5. Jim Hogg says:

    @David:

    No, we are not supporting a way to force vectorization.  As you know, such a flag is suitable only for expert users: if you mis-use, or mis-understand, then the program will give wrong answers.

    We Have no plans to summarize differences between flags/switches/options in the Microsoft compiler, versus others.

  6. Bruce Dawson says:

    Does the __restrict keyword help with vectorization, by indicating when two pointers that could alias are, in fact, guaranteed not to?

  7. I still don't buy the reasoning for not supporting forced vectorization. Forced parallelization (parallel_for, parallel_for_each) is even more dangerous when put in the wrong hands (timing bugs come to mind), yet VC happily supports it.

    BTW, judging from reason code 1100, VC11 does not vectorize (a ? b : c) except for the min/max idioms, does it?

    What does reason code 1104 mean? Is it that VC11 will vectorize

    for (…) a[i] = b[i];

    but not

    float t;

    for (…) { t = b[i]; a[i] = t; }

    t = 0; // explicit KILL

  8. Jim Hogg says:

    @Bruce:

    Yes, __restrict helps.  In this first release, for auto-vectorization, we benefit by avoiding alias checks.  And it's on the TODO list to make wider use of the __restrict "user guarantee".

  9. Jim Hogg says:

    @ysitu:

    Yes, I agree that forced parallelization is even more dangerous than forced vectorization.  And yes, the PPL library happily supports forced parallelization via the parallel_for and parallel_for_each constructs.  Given that as our start position, I understand the logic that says "support forced vectorization" too.  (But it's rather like saying: "you already handed us a rifle, so give us a gun too"  🙂  I favor the alternative route: provide deeper analysis for parallel_for/parallel_for_each that detects and warns against unsafe use.  (That's my personal opinion – not an official plan-of-record).  I'm also optimistic that auto-vectorization can hit more cases than currently.  There's an interesting paper from UIUC+IBM that examines the issue, at polaris.cs.uiuc.edu/…/pact11.pdf

    Code 1100 – (a ? b : c) – yep, we only vectorize those patterns that equate to calculating min or max.

    Code 1104: Yes, you got it.  Four consecutive values of the scalar t are loaded into an XMM register within the loop body.  If used beyond the loop, however, we don't track and extract the "last" value for scalar t from the XMM register that holds it.

  10. Chris says:

    What are the rules for function calls in the loop? For example a[i] = f(a);. Does f have to be an intrinsic function like the ones in the math library or are non-intrinsic functions also allowed?

    What are the rules for the array types? For example let the arrays be: unsigned char a[1000], b[1000];. And the operation: a[i] += b[i];. Would this perform 16 vector additions at a time since the vector registers are 128 bits wide?

  11. Jim Hogg says:

    @Chris:

    Only intrinsic functions are allowed.  This makes sense, if we think about how it would work … sure, we might spawn 4 threads to run 4 invocations of the function, f, in parallel.  But there's no mechanism that lets us execute 4 invocations of any user-written function, f, at the same time on vector registers.

    In terms of element size – so long as the calculations are simple, the auto-vectorizer will on any size element from char thru double.  But other issues can complicate the picture – for example, integer arithmetic that requires widening to 32 bits, etc.

  12. Jim Hogg says:

    @ysitu:

    Just a follow-up on one detail of my previous answer – specifically, my last sentence "If used beyond the loop, however . . . ".  As your comment points out, the "t = 0" statement, immediately following the loop body, kills that instance of t.  Compiler dataflow analysis catches this fact, and auto-vectorization goes ahead.

    In passing, many readers may be puzzled by some of the more detailed questions/answers going on in this blog, wondering "what on earth are they talking about"?  If interested, checkout the classic text "Advanced Compiler Design & Implementation" by Muchnick.  It provides details on compiler optimizations.

  13. Chris says:

    I've been experimenting with the auto-vectorizer.

    For some reason it works when the loop counter is short, but not unsigned short. Why doesn't it accept unsigned short?

    loop not vectorized due to reason '1200'

  14. Jim Hogg says:

    @Chris:

    This is a restriction with first release.  See the previous post, in the first section, that explains the restrictions:

    blogs.msdn.com/…/auto-vectorizer-in-visual-studio-11-rules-for-loop-body.aspx

    Workaround is to stick with int or size_t

  15. Chris says:

    @Jim:

    Alright.

    I've tested the auto-vectorizer in some scenarios now and I must say it's very good. In some cases I actually get the maximum theoretical speedup or more (for example with floats I even exceeded 4x a little bit).

    Looking forward to new posts on the vectorizer. 🙂

  16. GregM says:

    Are there plans to replace the codes with the text version so that we don't need to have this blog post handy?

  17. Bogdan Lytvynovskyi says:

    AFAIK the precision of floating point operations is different in SIMD and non-SIMD instructions. Is it somehow handled by auto-vectorizer?

  18. Jim Hogg says:

    @Bogdan:

    Answer is a little complicated, as follows:

    In VS11, the default floating-point model uses XMM registers.  (Previously, the default was x87).  So, a scalar calculation, such as (1.2f + 3.4f) will use the low 32 bits of some XMM register.  And an analogous, vector, SIMD calculation, would use all 128 bits of some XMM register to perform 4 additions in parallel.  So results for instructions such as +, -, * and so on, built into the chip, will be the same, whether the calculation is scalar or vector.

    However, if you are calculating math functions, such as sin or cos, we have a scalar library, and a vector (SIMD) library.  Results from either library are close, but not identical.  This issue, by coincidence, was raised a week or two back, within the team.  Currently under investigation.

  19. Gary says:

    Does alignment affect the vectorizer? In x86 apps the default is 8 byte alignment, but from what I've read, SSE2 works better with 16 byte alignment.  Should we be changing our project settings to 16 bytes even for x86 apps?

  20. Jim Hogg says:

    @Gary:

    Sorry for the late response – I was out on vacation last week.  Yes, alignment matters, some.  If, at compile time, we 'know' an array, at runtime, will be 16-byte aligned, then we will generate a vector instruction, such as MOVPS, that itself assumes the source address is so-aligned.  If the compiler cannot make that determination, then it will emit an instruction, such as MOVUPS, that works on a source address that is 16-byte aligned, or not.  (There's a small perf hit, compared with a raw MOVPS).

    The default alignment of an array is that of the element type.  So a static array of floats would only be guaranteed to be 4-byte aligned.  Different if allocated on the heap (8 bytes on x86, I think?).  Maybe different if allocated as a local variable on the stack.

    Is it worth going thru your code and sprinkling __declspec(align(#)) ?  I am doubtful.  First, your code might involve a loop that steps thru your [float] starting at offset 1, 2, 3, etc which invalidates the intended alignment.  Second, on Nehalem architectures forward, the performance hit for unaligned vector instructions is quite small.  Third, the compiler may not miss tracking alignment and therefore emit unaligned vector instructions anyway (on the todo list).

    But I'd be interested in any results if you did choose to experiment with this.

    Thanks,

    Jim

  21. Gary says:

    Thanks.  I'm afraid I have had very unsuccessful results when using auto-vectorization and auto-paralellization with a large (1 million+ lines of code) project.  I'm curious how many instances in for example, Office or Windows you got.  My project had only 2 instances of vectorization and no instances of paralellization.  

  22. Gary says:

    Is it possible that using /Zc:forScope- could be the reason why it's failing so badly to find any candidates?  I noticed that note about the loop counter being local (but it also says "the function" so I thought it would be ok to use forScope- )

  23. Jim Hogg says:

    @Gary:

    I meant for the "local" in message 501 to be local-to-that-function, in contrast to global.  Whether the induction variable is scoped to the for loop, or escapes afterwards, should not impact whether the loop can be vectorized.  

    On the more general point – making the auto-vectorizer hit more loops than it currently does: yes, part of our ongoing work is to make the vectorizer recognize more loop patterns – in effect, to reduce the number of reason codes we emit.  So some of them will simply 'die' as we make the auto-vectorizer smarter.

Skip to main content