If you’ve not read previous posts in this series about auto-vectorization, you may want to begin at the beginning.
This post explains “aliasing”. The topic came up as an aside in a previous post. Now it’s time to explain what it means, why it complicates the life of the auto-vectorizer, and limits the speedup the auto-vectorizer can achieve to something less than the 4X theoretical maximum (for calculations on 32-bit ints or floats).
This post is a short digression from the main path of auto-vectorization. You can choose to skip it, without losing much. Or you can choose to follow NitPick, our seeker-after-truth, and dive in. As usual, we will proceed by example.
Aliasing hits the Auto-Vectorizer
Consider the following function that adds one array into another:
The loop is of the kind we call “embarrassingly parallel”. There is no interference (the technical term is “dependence”), between one iteration and any other. So, we can pretty much execute this loop in any crazy order we choose, and still get the right answer. For example:
- We could run this loop as-written, on a single thread
- We could run this loop backwards, on a single thread. Ie, for (int j = N-1; j >= 0; --j) a[j] += b[j]
- We could run this loop in random order: throw a random number, r say, between 0 and N-1 inclusive, and execute the corresponding iteration, a[r] += b[r] (no duplicates allowed, of course!)
- If we had deep pockets and bought a computer with lots of cores, we could perform each iteration on its own core: so core #0 would calculate: a += b, core #6 would calculate: a += b, and so on. And all of these calculations would proceed in parallel, independent and unaware of each others’ existences.
- We could vectorize this code, executing 4 iterations each time, using this pseudo-code: for (int n = 0; n < N; n += 4) a[n:n+3] += b[n:n+3]
[NitPick asks: what if N in Approach 5, is not evenly divisible by 4? Answer: well first, you probably don’t mean “evenly” divisible, because that would imply, to me, divisible by 4 an even-number-of-times, rather than an odd-number-of-times. You probably meant to say “exactly divisible by 4”. (Has NitPick just been nit-picked?) Yes, in that case, the answer would be wrong. We leave it as an exercise for the reader to add a scalar loop that finishes off the 1, 2 or 3 items left over in the calculation. Let’s just assume, for the moment, that N is indeed exactly divisible by 4, and move on with our lives]
In all of the above approaches, 2 thru 5, we would get the correct result – defined as the result of option 1. Or would we?
The answer is yes, but only if arrays a and b do not overlap. What do I mean? Consider the following example, where the main program calls AddInto, supplying two, overlapping sub-arrays of A as arguments a and b:
This is, of course, an unusual way to use AddInto, but perfectly legal C++ code. A few minutes with pencil and paper should convince you that the correct answer is A = [0,1,3,6,10,5,6,7] when counting up (option 1). A few more minutes should convince you that counting down (option 2) gives A = [0,1,3,5,7,5,6,7] which is wrong! Yet a few more minutes should convince you that the answer produced from vectorized code (option 5) is A = [0,1,3,5,7,5,6,7] and is also wrong! With option 3, you might get lucky and hit the right answer, depending on the sequence of random numbers thrown; but in general, it’s likely to get the wrong answer. Option 4? – likely wrong too.
Here is a picture of what happens for the vectorized case, expressed in pseudo-code (note the difference between A, the array in main – and a, the array in AddInto):
(I said “a few more minutes with pencil and paper” is enough to work out the result of the count-down and the vectorized loops. In fact, I find such paper calculations quite tricky – a tiny lapse in concentration and it all goes wrong. So I wrote a small program that emulates 4-at-a-time vectorized loops. This has saved me from making several silly mistakes in the examples for this blog)
We can expect similar wrong results, unless we check that the arrays passed to a function do not overlap. The auto-vectorizer takes these same precautions: if it can “see” the whole program and work out that the arrays do not overlap, it can go ahead and try to vectorize loops. But if it is compiling a single file and “sees” only the function’s definition, then it will inject extra runtime checks that the arrays do not overlap, before generating vector code. If the checks pass at runtime, it executes the vectorized code. If not, it falls back to executing scalar code.
[NitPick asks: surely no sane person would call a function, such as AddInto, with arrays that overlap? Answer: probably not, but a compiler “must never equate the unlikely with the impossible” – neat phrase – I can’t remember if it’s mine, or I read it somewhere. It must NEVER produce wrong answers. In contrast, whilst such wacko code is legal in C++, it is illegal in some other languages, such as FORTRAN: if you alias data, then the compiler is allowed to give you “wrong” results. This is part of the reason why FORTRAN compilers can sometimes achieve better performance than C++ compilers]
So, with this example, we see an illustration of the problem called “aliasing” – where the same memory location, or locations, can be accessed via two different names. In this case, inside AddInto, a == b, a == b, and so on.
[NitPick again: so do these extra checks for overlap reduce the speedup below the maximum possible 4X mentioned in the opening to this post? Answer: yes, in part. Another part relates to the overhead of incrementing the induction variable, and checking for whether we have finished the loop yet. This latter factor can be reduced by “unrolling” the vectorized loop, also yielding further opportunities for optimization by increasing the basic-block size. Other factors that can severely limit performance are poor use of the hardware caches. We will see this in a later post that examines vectorization of matrix-multiply]
That covers the essence of “aliasing” as it applies to auto-vectorization. However, as you may suspect, aliasing impacts optimization of even scalar code . . .
Aliasing hits Everything
Aliasing – where a program can access the same memory location via two different names – causes headaches for any compiler trying to optimize C++ code. Here is a tiny example:
This function updates the memory locations pointed-to by p, and by q. It then returns *p. So, it looks like this function will always return the answer 1. If we pretend to be a compiler, it might be tempting to optimize this code into:
This is surely valid, right? . . . . . . . . . . . wrong!
Because someone might call the function like this:
and legitimately expect x to end up holding 2, not 1.
This example is trivial, but hopefully provides enough insight that you can extrapolate to real-world examples, and appreciate just how much alias analysis impedes compiler optimizations. The example involves simple pointers to scalar variables. Life becomes more complex with pointers-to-arrays, and pointers-to-pointers.
[NitPick asks: why can C++ not provide a mechanism for the user to promise there is no aliasing going on? Answer: it does. In ISO-C99, there is the restrict keyword; whilst Visual C++ provides __restrict and declspec(restrict) and a somewhat related declspec(noalias). On a wider scale, it also supports the /Oa and /Ow switches. However, these constructs conceal a bunch of subtleties. They are inherently dangerous – like juggling with sharp knives. If another developer does not understand and respect the promise made by their use, he can get answers that are tragically wrong, without even a whimper from the compiler]
Note that Weirong explained aliasing in the context of C++ AMP, in this blog, back in April. You may want to compare the different approaches followed by C++ AMP (a library) and the auto-vectorizer (a compiler).
Aliasing – Summary
Aliasing is the term used where the same memory locations can be accessed by different names. Writing programs that include aliasing is legal in C++. The auto-vectorizer, like other optimizations within the compiler, is painstaking in its diligence to always generate correct code, even in the presence of this annoying language feature!