MSVC code optimizer improvements in Visual Studio 2017 versions 15.5 and 15.3

In this post, we’d like to give you an update on the significant progress the Visual C++ code optimizer made in the past year, focused mostly on the features released in the 15.3 and 15.5 versions. Compared to VS2015 Update 3, VS2017 15.5 provides on average an 8.9% increase in runtime speed in the SPEC 2017 benchmark (for detailed numbers see slide 12 from the CppCon presentation or the CppCon session video).

The following sections of this post will go through some of the new optimizations and improvements that are being made available with the latest version, hopefully giving an interesting overview over the internal working of a modern native compiler backend.

General improvements to the SSA Optimizer

The SSA Optimizer is a new framework introduced last year in Visual Studio 2015 Update 3 that operates on Static Single Assignment form. As anticipated, it allowed us to make a significant amount of progress in a short time, a large part of the optimizations described here being implemented inside the framework. There are several general improvements available in the latest compiler release:

  • The SSA Optimizer now runs twice, before and after the loop optimizer. This allows it to take advantage of new opportunities exposed by loop optimizations and other second-order effects.
  • Expressions using address-taken variables and indirect memory loads are better handled by using an aliased SSA form and value-numbering of loads (identifies memory locations with the same value).
  • An extended collection of patterns that simplify code further and help reduce code size.
Common Sub-Expression Elimination and Partially Redundancy Elimination overhaul

Common Sub-expression Elimination (CSE) is an optimization that removes redundant code by identifying identical expressions and keeping one instance, replacing the others with the precomputed value. It is one of the fundamental optimizations and usually helps both in improving execution speed and reducing code size. The new approach in the SSA Optimizer is based on Global Value-Numbering, with a strong focus on eliminating redundant indirect memory loads, which can be quite expensive, especially when the data is not found in the CPU cache anymore. The following example demonstrates how the source of a load can now be another load, a store or a memset/memcpy to the same memory location. The temporary introduced by CSE is initialized with the value that would have been loaded on each path and can now be enregistered:

Before After load CSE
if (condition1) {
    x = *p;
    use(x);
}
else if (condition2) {
    *p = 1;
}
else {
    memset(p, 0, 100);
}

y = *p;
use(y);
if (condition1) {
    x = *p;
    use(x);
    temp = x;
}
else if (condition2) {
    *p = 1;
    temp = 1;
}
else {
    memset(p, 0, 100);
    temp = 0;
}
y = temp;
use(y);

A special form of CSE of loads is done for the ternary operators and SSA Phi instructions, like in this example:

Before After CSE
x = *p; use(x);
y = *q; use(y);
a = condition ? p : q;
b = *a;
x = *p; use(x);
y = *q; use(y);
b = condition ? x : y;

After failing to find an available source for *a, loads/stores of the selected values p, q are searched instead, replacing *a with condition ? x : y. A practical example of such a situation is code using std::min/max, as reported here.

Partial Redundancy Elimination (PRE) is a new addition, handling expressions that are redundant only on some paths through a function by inserting the expression on the paths it is missing, making it completely redundant. A simple example of PRE:

Before After PRE After code hoisting
if (condition1) {
    x = a * b;
    use(x);
}

y = a * b;
use(y);
if (condition1) {
    x = a * b;
    use(x);
    temp = x;
}
else {
    temp = a * b;
}

y = temp;
use(y);
temp = a * b;

if (condition1) {
    x = temp;
    use(x);
}

y = temp;
use(y);

A more complex example of PRE can be found in a hot function from the SPEC2017 Imagick benchmark. In this case, there are 5 redundant loads and 4 redundant float multiplications that are eliminated, and since images are usually in RGB(A) format, most eliminated expressions were always executed.

Before After PRE
if ((channel & RedChannel) != 0)
    pixel.red+=(*k)*alpha*GetPixelRed(p);
if ((channel & GreenChannel) != 0)
    pixel.green+=(*k)*alpha*GetPixelGreen(p);
if ((channel & BlueChannel) != 0)
    pixel.blue+=(*k)*alpha*GetPixelBlue(p);
if ((channel & OpacityChannel) != 0)
    pixel.opacity+=(*k)*GetPixelOpacity(p);
if (((channel & IndexChannel) != 0) &&
    (image->colorspace == CMYKColorspace))
    pixel.index+=(*k)*alpha*GetPixelIndex(…);
gamma+=(*k)*alpha;
temp1 = *k; 
temp2 = temp1 * alpha;

if ((channel & RedChannel) != 0)
    pixel.red+=temp2*GetPixelRed(p);
if ((channel & GreenChannel) != 0)
    pixel.green+=temp2*GetPixelGreen(p);
if ((channel & BlueChannel) != 0)
    pixel.blue+=temp2*GetPixelBlue(p);
if ((channel & OpacityChannel) != 0)
    pixel.opacity+=temp1*GetPixelOpacity(p);
if (((channel & IndexChannel) != 0) &&
    (image->colorspace == CMYKColorspace))
    pixel.index+=temp2*GetPixelIndex(…);
gamma+=temp2;
Inliner improvements

Inlining is one of the most important optimizations, not only eliminating the overhead of function calls, but more importantly, adapting the inlined code to the context of the function it is inlined into, providing more precise information about parameters that allows better optimizations. A significant part of the performance uplift between VS 2015 Update 3 and VS2017 15.5 is due to several improvements to the inliner that make it more aggressive, with a more accurate heuristic for estimating profitability. Some of the changes include more inlining inside nested loops, always inlining of internal/static functions called once and using more contextual information about the actual values of the parameters after inlining.

Very small functions are now always inlined, as long as this doesn’t create an unreasonably large function. A similar improvement was also done for profile-guided optimizations, where very small functions and functions that only forward to other functions are more likely to be inlined, since in general this reduces code size, the inlined code being smaller than the call sequence. The inliner is now also able to handle inlining of functions that return by-value C++ objects that may throw an exception.

New CFG optimization module

The initial release of the SSA Optimizer was targeted mainly at expression and peephole optimizations. Now besides the new CSE/PRE module, it also includes a module for performing Control-Flow Graph (CFG) optimizations in SSA form. This is split in two parts, one for performing the actual optimizations, the other for cleanup, such as removing useless branches/jumps and unreachable code in a function.

The first implemented optimization is early hoisting and sinking of similar expressions. The algorithm used here is more aggressive than the one in the late compilation stages, relaying on Value Numbering and being able to extract instructions even when there is a mismatch at the start/end of the basic block. For example, the instructions that are similar could be in the middle of the basic block and the sequence of extracted instructions doesn’t have to be contiguous. This way it can find multiple independent expressions and hoist/sink them. Besides reducing code size, the early hoisting/sinking can expose other optimization opportunities, such as replacing a branch by a conditional move expression (CMOV), as shown in the following example:

Before After sinking store After building CMOV
if (condition) {
    *p = x;
}
else {
    *p = x + 1; 
}

 

if (condition) {
    temp = x;
}
else {
    temp = x + 1;
}

*p = temp;
temp = condition ? x : x + 1;
*p = temp;

Many more CFG optimizations are planned to be implemented in the new module – there are already three new optimizations in the testing phase that will be released in a future version of the compiler.

Improvements for float optimizations under -fp:fast

There is a significant improvement for the optimizations done under the -fp:fast floating point model in the SSA Optimizer, extending the existing arithmetic simplifications and adding support for handling common functions from the standard library:

  • POW strength reduction, replacing a call to POW by a series of multiplications when the exponent is an exact value such as for pow(x, 16.0). In a micro-benchmark, calling the pow function is 31x slower than the 4 multiplications needed to compute the same value. The replacement expression is generated in a minimal form – for example pow(a, 8.0) is replaced by 3 multiplications computing [(a^2)^2]^2. There are four cases handled: pow(a, N.0), pow(a, N.5), pow(a, -N.0) and pow(a, -N.5).
  • A large collection of simplifications based on identities of the transcendental functions. A few examples:
sqrt(a) * sqrt(b) -> sqrt(a * b)
pow(a, x) * pow(a, y) -> pow(a, x + y)
pow(a, x) * a -> pow(a, x + 1)
exp(a) * exp(b) -> exp(a + b)
sin(a) / cos(a) -> tan(a)
  • Combining calls of sin(x) and cos(x) into a single call to the math library, computing both values in the same amount of time. This is available on x86 and x64, where SSE2 code generation is enabled by default.
  • More arithmetic simplifications focused on eliminating division/multiplication, and improved detection of MIN/MAX/ABS operations from branches plus new identities. A few examples:
a / (1 / b) -> a * b
a / b / c / d -> a / (b * c * d)
abs(a known positive) -> a
max(min(a, b), a) -> a

We are strongly encouraging people to use the -fp:fast flag for best performance, unless precision up to the last bit is required. In several benchmark suites there are significant performance wins from optimizing float expressions in a similar way to integers and from the special handling of common patterns like those exemplified above.

Removing more unnecessary instructions

The SSA Optimizer includes a Bit Estimator component that is able to determine which bits of a value are known to be always one/zero, among other facts (for examples see the previous blog post). This is now augmented with a sophisticated analysis that estimates the bits of a value that are affected by an operation and the bits that are actually required, allowing the removal of unnecessary instructions that don’t affect the final result of an expression. Some examples:

Before After
x = a | 3;  // Sets lowest 2 bits, useless.
y = x >> 4; // Lowest 4 bits not required, shifted out.
y = a >> 4;
x = a & 0x00FFFFFF; // Clears highest 8 bits, useless.
y = x | 0xFFFF0000; // Highest 16 bits not required, always set.
y = a | 0xFFFF0000;

Such cases appear often in practice, some of the most interesting examples were found in the Windows kernel/drivers. Removing such unnecessary instructions was also one of the most frequent type of optimization opportunity exposed by the Souper superoptimizer.

Loop unrolling improvements

Loop unrolling is an optimization that exposes more instruction-level parallelism by duplicating the loop body multiple times and reducing (or completely eliminating) the overhead of the iteration counter. The complete unrolling of loops in Visual C++ sees a big improvement, now being much less conservative with the unrolling amount thanks to a better heuristic for estimating the benefit and an improved way of computing the constant number of iterations (trip count) of the loop. Complete loop unrolling often allows more subsequent optimization of expressions and store-load forwarding (replacing a load by the value that was stored previously at the same memory location), like in the example below, where the index variable is replaced by a constant, allowing expressions to be constant-folded later:

Before After loop-unrolling After subsequent optimizations
for (int i = 0; i < 4; i++) {
  p[i] = i*4 + 2;
}
i = 0;
p[i] = i*4 + 2;
i++;
p[i] = i*4 + 2;
i++;
p[i] = i*4 + 2;
i++;
p[i] = i*4 + 2;
p[0] = 2;
p[1] = 6;
p[2] = 10;
p[3] = 14;

Loops that are too large to unroll completely are partially unrolled and still give a performance benefit without bloating code size. Several SPEC2017 benchmarks benefit from the improved loop unrolling, up to a 5% performance win.

Loop if-unswitching improvements

Loop if-unswitching is an optimization that removes a branch from a loop by creating two versions of the loop, each with the code from one side of the branch, and the original branch selecting instead between the two loops. This can be done when the branch condition doesn’t change inside the loop (loop invariant) and it benefits modern CPUs by creating shorter loops, without control flow that can pollute the branch prediction tables. Visual C++ had a simpler version of if-unswitching, which is now improved to handle more general situations, like in the example below, where there is extra code before/after the branch.

Before After if-unswitching
for (int i = 0; i < n; i++) {
    // Code before branch.
    if (invariant_condition) {
        // “then” code. 
    }
    else {
        // “else” code.
    }
    // Code after branch.
}
if (invariant_condition) {
    for (int i = 0; i < n; i++) {
        // Code before branch.
        // “then” code.
        // Code after branch.
    }
}
else {
    for (int i = 0; i < n; i++) {
        // Code before branch.
        // “else” code.
        // Code after branch.
    }
}
Sinking of loads near uses

This is an optimization also known as partial dead-code elimination. Its purpose is to move expensive expressions closer to where they are actually used, in the hope that they are never executed if pushed under an if condition or if the function exits earlier. Another handled case is an expression assigned to a variable that is redefined later on some paths, like in the second example below. Currently this is limited to sinking loads, future versions of the compiler will extend it to more general expressions.

Before After sinking load
x = *p;

if (condition) {
    return -1;
}

use(x);
if (condition) {
    return -1;
}

x = *p; // Delay load *p.
use(x);
x = *p;

if (condition) {
    x = *q;
}

use(x);
if (condition) {
    x = *q;
}
else {
    x = *p; // Avoid load *p on *q path.
}

use(x);
Vectorizer improvements

More loops, with or without branches, are now vectorized thanks to an improved heuristic for estimating the benefit of vectorization and having more accurate alias information for pointers. The vectorization of code searching the min/max value in an array now also supports the case where the index of the selected value is required, like in the following example:

for (i = 0; i < N; i++) {
    if (values[i] > max_value) {
        max_value = values[i];
        max_value_index = i;     
    }
}
use(max_value, max_value_index);
Improved CMOV generation and handling of std::min/max

The generation of conditional move instructions (CMOV) from branches is improved, especially for float values, which helps in cases where branches are not well predictable. Below is an example from a Geekbench 4 benchmark:

offset = lo + delta;
if (curve[offset] > log_exposure) {
    hi = hi - delta;
} else {
    lo = lo + delta;
}
x64 before x64 now
comiss   xmm0, xmm4
jbe      SHORT $LN4@log_exposu
sub      ecx, r8d
jmp      SHORT $LN5@log_exposu
$LN4@log_exposu:
mov      edx, r9d
$LN5@log_exposu:
sub     eax, ecx
comiss  xmm3, xmm2
cmovbe  eax, r9d
cmovbe  edx, r8d

std::min/max were previously somewhat problematic for the optimizer because they take the values by reference, turning a direct access of a local variable into an indirect access through a pointer. The improvements to eliminate these indirect access cases for integers now also apply to float types. For example, the clamp operation has now optimal code generation:

float clamp(float n, float lower, float upper) {
  return std::max(lower, std::min(n, upper));
}
x64 before x64 now
n$ = 8
upper$ = 24
clamp
comiss   xmm0, xmm2
lea      rax, QWORD PTR upper$[rsp]
lea      rcx, QWORD PTR n$[rsp]
movss    DWORD PTR [rsp+24], xmm2
movss    DWORD PTR [rsp+8], xmm0
cmovbe   rax, rcx
movss    xmm0, DWORD PTR [rax]
comiss   xmm1, xmm0
jb       SHORT $LN10@clipf
movaps   xmm0, xmm1
$LN10@clipf:
ret      0
clamp
minss   xmm0, xmm2
maxss   xmm0, xmm1
ret 0


For integer values: 
clamp_int
cmp     r8d, ecx
cmovl   ecx, r8d
cmp     edx, ecx
cmovl   edx, ecx
mov     eax, edx
ret 0
In closing

We are excited to finally release all these new and improved optimizations in the compiler backend and help make your programs faster. Expect to see many more additions in future versions – we are continuously working hard to implement new optimizations, improve existing ones or replace some of the older ones with newer, better approaches, such as the work done in the SSA Optimizer.

Please let us know if you have any feedback or suggestions about cases that could be optimized better. We can be reached via the comments below, via email (visualcpp@microsoft.com) and you can provide feedback and report bugs via Help > Report A Problem in the product, or via Developer Community.