We are excited to announce the preview release of a new, advanced code optimizer for the Visual C++ compiler backend. It provides many improvements for both code size and performance, bringing the optimizer to a new standard of quality expected from a modern native compiler.
This is the first public release and we are encouraging people to try it and provide suggestions and feedback about potential bugs. The official release of the new optimizer is expected to be Visual Studio Update 3, while the release available today is unsupported and mostly for testing purposes.
How to try it out
The compiler bits with the new optimizer are very easy to get: just install the latest VisualCppTools package using NuGet. Details about how to do this are available in this blog post. Once installed, compile your applications the usual way – the optimizer is enabled by default on all architectures.
Update 06/10/2016: The new optimizer is now also available as part of Visual Studio Update 3 RC.
Reporting bugs and suggestions
We are hoping to get as much feedback as possible about bugs you have found or suggestions you may have. If you believe you found a bug, you can confirm it’s caused by the new optimizer by using the following undocumented flag to disable it: -d2SSAOptimizer-
- In the Visual Studio IDE, add the flag to the project Property Pages -> C/C++ -> Command Line -> Additional Options text box
- If you compile from command line using cl.exe, add the flag before any /link options
If the bug does not manifest anymore with -d2SSAOptimizer-, please follow the steps below:
- Submit a bug report using the Connect website
- Prefix the title with [SSA Optimizer]
- Attached details such as the compiler version, compile flags, and the source code that reproduces the bug in the form of pre-processed files or a linkrepro. Bruce Dawson’s blog has a great post about producing high-quality bug reports
- You can also send an email directly to gratilup@microsoft.com
Why a new optimizer?
The main motivation for a new optimizer framework was the desire to have more aggressive optimizations, such as ones that take advantage of more compile-time information and modern compiler developments. The design of some of the older optimization passes made it difficult to implement more advanced transformations and to make improvements at a faster pace. As the new framework was intended to be the basis of many future optimization efforts, a core design objective was to make it easier to implement, test and measure new optimizations.
Some of the main goals of the project:
- Improving the code quality for both scalar and vector code
There are many cases where both performance and code size can be improved, sometimes quite substantially. The framework attempts to solve several deficiencies of the old optimizer:
- The old expression optimizer has a small set of known transformations and a limited view of the function – this prevents discovering all the expressions that could be optimized.
- Many small optimizations based on identifying patterns – known as peephole optimizations – are either missing or implemented only for certain target architectures.
- Vector code – either from intrinsics or generated by the auto-vectorizer – can be optimized better.
The new optimizer takes advantage of the Static Single Assignment form, which allows handling more complex expressions, that potentially span the entire function. Another advantage of the SSA form is that it makes it possible to write simpler and more efficient algorithms, eliminating the need of using more complicated and slower techniques such as data-flow analysis.
Peephole optimizations can now be implemented in a target-independent way, using a pattern matching system that is very fast (based on template meta-programming) and which requires little code to be written. This allowed adding a large number of patterns in a fraction of the time it takes to add using the usual way of identifying patterns.
The same pattern matching mechanism can be used for vector operations, making it now possible to optimize expressions using both integer and float vector operations as easily as expressions with scalar operations. Note that this feature is not yet complete and enabled.
- Designing a framework that allows easy development, with less potential for mistakes
Being able to quickly prototype ideas and move to a reliable implementation is one of the main advantages of the new framework. It includes various helpers for easier manipulation of the SSA form, pattern matching of expressions, building new expressions and doing safety checks in the presence of pointer aliasing and exception handling.
- Performing better static analysis of the code
The new optimizer also adds new static analysis modules, including those that can identify when a value is Boolean (exactly either 0 or 1), when a value is always positive, and when a value cannot be zero. It also has a powerful module that can estimate known one/zero bits of a value, and the ranges a value could fall in. The results are either used as preconditions for certain optimizations, to eliminate some useless operations completely or to transform operations into a form that can be optimized better.
- Strong emphasis on testing and correctness
Given the large scope of the project, ensuring and maintaining correctness was a top priority. This was achieved by using formal verification, testing with randomly-generated programs (fuzz testing) and popular programs and libraries such as Chrome, Firefox, CoreCLR and Chakra. See the Testing approach section below for more details.
Examples of implemented optimizations
The following is an example that illustrates just a few of the many new transformations the new optimizer implements. This sort of code is often found in codecs:
int test(int a) { return a % 2 != 0 ? 4 : 2; }
x64 assembly with old optimizer | x64 assembly with new optimizer |
?test@@YAHH@Z PROC and ecx, -2147483647 jge SHORT $LN3@test dec ecx or ecx, -2 inc ecx $LN3@test: test ecx, ecx mov eax, 2 mov edx, 4 cmovne eax, edx ret 0 |
?test@@YAHH@Z PROC and ecx, 1 lea eax, DWORD PTR [rcx*2+2] ret 0 |
The execution time with the old optimizer is approximately 5 cycles in the best case (this assumes out-of-order execution and perfect branch prediction) and at least 10 cycles in the worst case. With the new optimizer, execution time is always 2 cycles. Obviously, there are also important savings in code size.
Very interesting results can be achieved by combining multiple smaller transformations. In this case, there are two patterns applied to produce the final result:
- a % 2 == 0 -> a & 1 == 0
Since the remainder is esed to zero, the sign of a does not affect the compare result and the remainder can be replaced by AND. - a<bool> ? C1 : C2 -> C2 + a*(C1-C2)
A ternary question operation selecting between two constants. The first requirement is that the condition value is Boolean, which the static analysis package can determine. The second is that C1-C2 is a power of two, so that a shift or LEA is generated instead of a multiplication.
Let’s see a few more examples of interesting optimizations and patterns that are implemented. Focus was put especially on operations that were previously not optimized very well, such as comparisons, conversions, divisions, question and control-flow dependent expressions (PHI operations in SSA form). Although some examples might seem unlikely to be written like that in the source code, they do appear quite often after inlining and other transformations.
- Improved optimization of arithmetic expressions, including scalar float operations
The SSA form exposes larger expressions, which can span the entire function – this allows discovering more optimization opportunities, especially when combined with expression reassociation. There are also dozens of new patterns added, such as the following ones:
(a / C1) / C2 -> a / (C1 * C2)
(a * C1) / C2 -> a * (C1 / C2)
a / (x ? C1 : C2) -> a >> (x ? log2(C1), log2(C2)) // C1 and C2 must be power of two constants
Most new float optimizations are enabled only under -fp:fast, but some of them are valid under the default -fp:precise.
More information about the optimizations allowed under different floating point models is available in the documentation:
Microsoft Visual C++ Floating-Point Optimization
- Optimizing control-flow dependent expressions
I mentioned above that the SSA format simplifies handling larger, more complex expressions. One advantage is that it makes it easier to reason about variables that are either redefined, or defined with different values based on the path taken in the function. As its name implies, SSA solves this by creating a different version of the variable each time it is redefined; if there are points in the function where a variable has more than one possible value, a pseudo-operation known as PHI is inserted, merging all values.
Although building the SSA format is quite complicated, the example below should be simple enough to get a good intuition about SSA and the role of the PHI operations:
Original code | After SSA conversion |
int test(int a, int b) { int x, y, z; if(a > 3) { x = 4; y = 1; z = b & 0xFF00; } else { x = 9; y = 2; z = b << 8; } int p = (x * y) * 4; int q = z & 0xF; return p >= 16 && q == 0; } |
int test(int a1, int b1) { int x0, y0, z0; // undefined if(a1 > 3) { x1 = 4; y1 = 1; z1 = b1 & 0xFF00; } else { x2 = 9; y2 = 2; z2 = b1 << 8; } x3 = PHI(x1, x2) y3 = PHI(y1, y2) z3 = PHI(z1, z2) int p1 = (x3 * y3) * 4; int q1 = z3 & 0xF; return p1 >= 16 && q1 == 0; } |
As it can be seen on the right side, each variable is renamed to multiple versions (indicated by the number suffix). After the if-then-else statement, all three variables can have two different values, depending on the runtime result of a > 3, making it necessary to insert PHI operations.
The new optimizer is able to take advantage of the PHI operations and turn the entire function into the equivalent of return 1, all other code being removed by Dead Code Elimination. That’s 1 instruction compared to the 18 that were generated before on x64.
For p1 >= 16 it computes every possible value and compares it with 16, which is the minimum possible value. For q1 == 0 it checks if the low bits are known to be zero in both z1 and z2.
The old expression optimizer is not able to reason about the larger expressions that involve these PHI operations – this causes it to miss many optimization opportunities, like the ones exemplified above. In the new optimizer, every operation and static analysis supports PHI. A few more examples:
(phi 3, 5) + 2 -> phi 5, 7 // constant-fold by pushing operand inside a PHI (phi b+3, b+5) - b -> phi 3, 5 // eliminate operation by pushing operand inside a PHI phi a+x, b+x -> (phi a, b) + x // extract a common operand from a PHI (phi 1,2) + 3 < (phi 3,4) + 5 -> true // fold compare by testing all combinations (phi 1,2) * (phi 2,3) > (phi 6,7) * phi(2,3) -> false // similar to above example (phi 1,0) * 5 > (phi 1,2) -> undecidable // 0 * 5 < (phi 1,2)
The following is an interesting case found in Mozilla Firefox. A Boolean expression, spanning an if-then-else statement, is used in a negated form if(!expr). The new algorithm that tries to cancel an inverted Boolean operation by inverting every subexpression did the following transformation, eliminating the inversion:
(phi 0, (x ? 1 : 0)) ^ 1 -> phi 1, (x ? 0 : 1)
- Better conditional move generation
Converting branches to CMOV produces more compact code that usually executes faster. The late CMOV generation phase is augmented by generating question operations during the new optimizer. In doing so, already-existing transformations can be applied, simplifying things even further. In the following examples, the left-hand side is a newly-detected CMOV pattern, and the right-hand side is the code after a transformation is applied:
a < 0 ? 1 : 0 -> a >> 31 // logical shift a < 0 ? 4 : 0 -> (a >> 31) & 4 // arithmetic shift a<bool> != b<bool> ? 1 : 0 -> a ^ b // a, b must be Boolean values
CMOV performance can sometimes be hard to estimate, especially on modern CPUs with good branch prediction. To help in cases where a branch would be faster, when profile information is available, the CMOV is not generated if the branch is highly predictable (heavily biased as either taken or not-taken).
- Improved optimization of compare operations
Comparisons are the operations with the most improvements. Since reducing the number of branches benefits both code size and performance, the focus was mainly on branch folding (eliminating a branch by proving that it is either taken or not-taken). Besides the usual tests for comparing constants, static analysis is used to estimate value ranges and known one/zero bits, making it possible to handle more complicated cases. Among the dozens of transformations that simplify comparisons, the following one is an example that reduces execution time substantially:
a / 12 == 15 -> a in range [180, 192) -> (a – 180) < 12 // unsigned compare
A division (20+ cycles) is replaced by a simple range check (2 cycles). Even when the “divide by constant” optimization is applied, it is still a few times slower than the range check.
- Bit Estimator
This is a powerful static analysis that can be used to extract more compile-time information about values. Some of the provided features:
- Estimating bits known to be one or zero
- Proving that a value is not zero
- Estimating the minimum and maximum value
- Estimating value ranges
- Improved overflow checks for addition and subtraction
Below is a simple example showing how the one/zero bits can be computed at at compile time, even when nothing is known about the initial values (parameter a in the example below):
int test(unsigned char a) { short b = a; // b: 00000000________, a: ________ b <<= 4; // b: 0000________0000 b |= 3; // b: 0000________0011 return b != 0; // -> return true }
Some of the places where these features are currently used:
- Converting signed instructions to unsigned: produces smaller code for division/remainder with constant, allows folding constants into LEA instructions, etc.
- Folding comparisons and branches: comparisons are folded using both known bit and value range information. For example, given a == b, if a is known to have a bit set at a position where it is definitely not set in b, the two values cannot be equal. This can be applied to other conditions such as less-than by checking the sign bit. When using value ranges, every range of a is compared with every range of b.
- Improved overflow checks: optimizing a + C1 < C2 into a < C2 – C1 is not valid, since a + C1 might overflow, giving a different result. Using the known bits or value ranges, it can be proven that the addition does not overflow. In practice, this usually happens when a is a zero-extension from a smaller type.
- Discovering Boolean and positive values: used as pre-conditions for various optimizations, such as the ones applied on question operations. Another example is eliminating an ABS intrinsic if the value is already positive.
- Removing redundant AND/OR instructions, eliding useless conversions:
a % C -> 0 if C is a power of two and the low bits in a are zero (a is a multiple of C) a & C -> 0 if all bits that are one in C are known to be zero in a a | C -> a if all bits that are one in C are known to be one in a
- Improved Common Subexpression Elimination
Common Subexpression Elimination is an optimization that eliminates redundant operations by replacing them with the result of previous ones that compute the same value – this happens much more often than one may expect. The existing algorithm is augmented with one based on Global Value Numbering, which increases the number of expressions that are found to be equivalent. Although this is a quite simple initial implementation that will be made more powerful, it shows significant improvements for both code size and performance.
Eliminating redundant operations before doing the expression optimization also exposes more opportunities.
For example, (a + b) – c -> a if b is found to be equivalent to c.
- Taking advantage of signed integer overflow being undefined
Historically, Visual C++ did not take advantage of the fact that the C and C++ standards consider the result of overflowing signed operations undefined. Other compilers are very aggressive in this regard, which motivated the decision to implement some patterns which take advantage of undefined integer overflow behavior. We implemented the ones we thought were safe and didn’t impose any unnecessary security risks in generated code.
A new undocumented compiler flag has been added to disable these optimizations, in case an application that is not standard-conformant fails: –d2UndefIntOverflow–. Due to security concerns, we have seen cases where these patterns should not be optimized, even though following the C and C++ standards allows us to by making the potential addition overflow undefined:
a + Constant > a -> true // Constant > 0 a + Constant <= a -> false // Constant > 0
These two tests (and the similar ones with subtraction) are frequently used to check for overflow in places such as file readers and memory allocators. While the use is non-conformant with the standard and a well-known issue, enabling these transformations could potentially break the security of those applications.
Impact on code size
For most applications code size is reduced, but it can also increase due to interactions with other optimizations. For example, a smaller function is more likely to be inlined into multiple places, resulting in a size increase overall.
Below are some code size results from compiling several large applications on x64:
Application | Old optimizer | New optimizer | Reduction |
Windows | 1,112,545,269 | 1,112,096,059 | 438 KB |
SQL Server | 64,078,336 | 64,032,256 | 46 KB |
Chakra | 5,963,621 | 5,952,997 | 10 KB |
The following table lists the number of instructions, split by category, for the Windows Kernel built for x64 with link-time code generation and profile information. It can be seen that the number of more expensive instructions, such as branches, divisions and multiplications, is reduced. The increase in CMOV and SETcc is a result of more branches being converted to conditional code.
Instruction type | Old optimizer | New optimizer | Difference |
CONVERSION | 28075 | 27301 | -774 |
LEA | 87658 | 87395 | –263 |
SHIFT | 15266 | 15194 | -72 |
SETcc | 2222 | 2345 | +123 |
JUMP | 19797 | 19791 | -6 |
BRANCH | 143795 | 142591 | -1204 |
MUL | 2115 | 1990 | -125 |
DIV | 541 | 530 | -11 |
CMOV | 4192 | 5913 | +1721 |
Impact on compiler throughput
For all these improvements, compile time remains mostly the same, with about +/- 2% difference, depending on the application being compiled. For example, Google Chrome shows a compile time slowdown of 1.7%, while compiling the Windows Kernel shows a 2.6% speed-up. The speed-up can be explained by having less code go through the old, slower optimization passes.
Testing approach
Based on previous experience and the scope of the project, it was clear from the start that extensive testing needs to take a central role to ensure correctness. Several testing approaches were used, some to prevent mistakes in the first place, others to catch implementation problems:
- Preventing implementation bugs by formally verifying the patterns
Most patterns are quite simple, such as x & 0 => 0. But there are also patterns that requires validation that is not always very obvious, leaving place for mistakes. The most common validation bugs are:
- Failing to check for input preconditions, such as requiring positive numbers, powers of two, numbers with the N top bits 0, etc
- Failing to differentiate between signed and unsigned operations. This is especially dangerous for instructions such as CMP, DIV/REM and SHR.
Alive, a tool by Nuno Lopes from Microsoft Research, is a formal verification tool that was used to ensure the patterns and preconditions are correct before implementing them. It uses a language similar to LLVM IR and the Z3 theorem prover to verify if an input pattern is equivalent to the output pattern – if not, it prints a counterexample. Alive has already been used by the LLVM community with great success to discover many bugs. More details about Alive can be found on John Regehr’s blog: ALIVe: Automatic LLVM InstCombine Verifier.
- Covering and testing as many patterns as possible using random tests
Csmith is a random C program generator that has been used to discover a large number of bugs in various compilers. More than 15 million programs generated using CSmith have been tested, revealing several bugs in the new optimizer, plus bugs in other optimizer components. Very helpful in dealing with the huge failing tests was C-Reduce: is was able to reduce 200KB tests to tests of 2-3KB in size, making it much easier to spot the place with the bug.
- Testing every three-instruction expression
Opt-fuzz, a tool by John Regehr from University of Utah, is able to generate every small integer expression with N instructions and a limited number of possible constants as LLVM IR. The Clang/C2 project made it possible to test all 250+ million tests generated for three-instruction expressions, which revealed several subtle bugs.
- Using instrumentation and runtime checks
Complex components, such as the Bit Estimator and Value Numbering, were tested by instrumenting the compiled code with calls to a runtime library that verifies if the compile-time, static analysis results are actually valid. For example, in the case of the Bit Estimator, it would verify that the bits that were estimated to be always zero are zero at runtime. In the case of Value Numbering, it would ensure that two instructions that were assigned the same value number have the same value at runtime.
- Testing with popular open-source projects
Exposing the compiler to more real-world code proved to be an effective way to find more bugs. This includes building and testing Google Chrome, Mozilla Firefox, CoreCLR and Chakra.
Future improvements
As I mentioned at the start of the blog post, the framework is designed to be the place where many of the future optimizer features will be implemented. Below are some of the optimizations that are very likely to be part of the next major Visual Studio release – it does not include any of the longer-term projects that are planned:
- Complete and enable the optimization of vector operations
- Better optimization of Boolean expressions in C++ code
- Removal of operation with no effect on the expression result
- Merging similar branches
- Several Bit Estimator improvements
Closing remarks
Please try building and testing your applications with the new optimizer and report any problems that you might find. We are looking forward for your suggestions and opinions in the comment section. Let us know if you have examples of cases that could be optimized better and are not yet handled.
We are glad to finally be able to share this exciting new work with you! This marks the start of many optimizer improvements that will be added in the future releases of the compiler – we will keep you posted.
Thanks,
Gratian Lup
Visual C++ Optimizer team
1. Regarding the bit estimator, does this also work to inform it:
__assume(x & 0xFFFF == 0);
2. Regarding vectorized code, can I do something like:
void vadd(T *a, const T *b, size_t n) {
// alert the compiler that “n” is a positive multiple of 4
__assume(n > 0 && n % 4 == 0);
// alert the compiler that “a” and “b” point to 16 byte aligned memory
__assume((uintptr_t)a & 0xF == 0 && (uintptr_t)b & 0xF == 0);
…
}
If not, then what are the sanctioned ways to do the above so the vectorizer doesn’t have to worry about bad alignment?
Hi,
1. _assume is not used by the bit estimator, but a planned improvement is to use the alignment info from variables to prove the bottom N bits are zero. Alignment info is also propagated between functions with LTCG, so if all callers use a pointer argument with the same alignment, the alignment is known for the parameter.
2. The vectorizer uses runtime checks – one example is the for alignment, mentioned above. The time spent in the checks is tiny compared to the actual vector loops, not sure if it’s worth adding a non-standard extension in this case.
I also think that taking conditions inside __assume would help some people.
Therefore I have related questions:
1. Would bit/range estimators work in the code like this:
if (x & 0xFFFF != 0)
__assume(false);
It seems that __assume(false) should always work, and supposedly the new optimizer can derive information about bits from the fact that the branch is marked as dead?
2. If the code above works well, then it seems a bit strange that __assume(x & 0xFFFF == 0) does not work.
BTW, this is similar to the way __builtin_unreachable() is used in GCC.
Do you have any before/after performance results? For instance, how much did building the compiler with these optimizations turned on speed up the compiler’s throughput?
Performance depends on the application – it does improve several benchmarks by a double-digit percentage.
We did not test if there is a throughput improvement of the compiler itself, it’s an interesting idea. I think a significant improvement is unlikely though, the slow parts, in the backend at least, are more of an algorithmic nature. We do plan to replace some of the old and slower optimizations with new ones implemented in this framework.
a (a >> 31) & 4
Is this supposed to be (a >> 29) & 4?
That would be another way to write it.
In my example, doing an arithmetic right shift on the sign bit produces a bitmask that is either all-ones or all-zeros. An AND with the bitmask then produces the expected result.
I am surprised that after all the heavy lifting, the measured differences between the old and the new optimizer seem quite marginal. But maybe only the comparison methodology is at fault: a saved BRANCH/DIV/MUL inside an inner loop is great; in a process startup routine (where it would be executed just once), it is immaterial.
Of course, optimizing away some instructions in the right place can give significant improvements in certain benchmarks.
If you refer to the code size wins: they are a bit hard to measure because inlining behavior changes substantially – more inlining is done (especially in C++ code with STL), the code tends to increase if a function is inlined into multiple places. It still manages to save some code size, even though this wasn’t a priority.
The instruction type statistics are also influenced by inlining. Another thing to consider is that I showed results for the kernel, which is likely more carefully written than the average code.
I posted this on HN[1] as well but would you consider grouping “-d2SSAOptimizerUndefinedIntOverflow-” (that is disabling the optimizations) when the /sdl flag is passed to the compiler? I think this would make sense since if you’re using /sdl you would really rather it just generate safer, more secure code, and furthermore if you’re using /sdl you probably don’t care about the likely trivial performance gain from it being enabled. If other possible unsafe optimizations (e.g. overflow, aliasing, etc) are being added then you may want to consider grouping them under /sdl as well.
I realize that it’s been designed to try and only use this optimization in safe areas but really regardless of the guarantee here I’m not sure that it can always be trusted to do the right thing. I also wondered about making this option more than a hidden one but I’ll accept simply grouping it under /sdl.
[1] https://news.ycombinator.com/item?id=11640922
If you care about security, you should ensure your code does not leverage undefined behavior either way.
Perhaps in a world where I can write all the code for every dependency that my already obviously perfect code can use then maybe that would work.
Meanwhile in the real world where people are humans, make mistakes and compile code they didn’t write themselves I’m not sure how I could possibly ensure that your assumption is always true. I would rather the compiler not optimize out a check that it may deem unnecessary for a near trivial performance gain.
I don’t have an issue with the optimization, it’s when I’m explicitly compiling for security where I would rather it not be enabled.
Code with undefined behavior is undefined. One cannot make any assumptions on it. It may or may not work. One cannot blame compiler for generating “wrong” assembly if C++ code is “wrong”.
Of course people makes mistakes, that’s why there are static analyzers, sanitizers and other tools to help detect any issues with code.
Bottom line is that it is a lot easier to fix the source code rather than validating that it works with every compiler that you might use or that newer versions will not break the code.
question about “Better conditional move generation section”
a a >> 31 // logical shift
int main()
{
int32_t a = -6; // a is signed integer
cout <> 31)) << endl;
return 0;
}
it returns -1 not 1
why?
peter
Because for a signed value, a shift-arithmetic-right (sar) is done. In an sar, the “empty” spaces are filled to preserve the sign.
Your code is doing an arithmetic shift. To do a logical shift, like in the pattern, you have to first cast to unsigned int.
Are there any changes in regards to how many levels of calls can be inlined? Or in other words: is the new architecture able to inline better?
Not sure about number of levels, but it does change the inlining behavior quite a lot: by becoming smaller, functions are more likely to get inlined. C++ code using STL seems to benefit the most from it.
While testing for code generation bugs introduced by new optimizer, I have noticed that compiler does not respect the /Zc:throwingNew option anymore. First I thought it was due to the new optimizer, but after few more tests, it seems like the version in Update 2 already has this behavior.
In short, compiler inserts a check for null pointer after each call to operator new and if it is zero, calls into __imp__invalid_parameter_noinfo_noreturn function.
This looks like a meaningless code which also affects performance.
https://connect.microsoft.com/VisualStudio/feedback/details/2674432/incorrect-code-generation-with-zc-throwingnew-option
Hey, I just tried a small example and as far as I can tell throwingNew is still being honored. I don’t repro the problem you describe. If you could attach a fully self contained example to the bug you linked that would be helpful.
Here is my example:
#include
struct Y {
__declspec(noinline) Y() { printf(“Y”); }
};
Y* foo() {
return new Y();
}
The null check in foo() is omitted correctly when compiling with /Zc:throwingNew /EHsc /c, and is present without /Zc:throwingNew.
A check for a value returned by operator new is inserted in the following code:
// #define _CRT_SECURE_INVALID_PARAMETER(expr)
#include
void main()
{
std::vector v(100);
}
Unless the first line is uncommented. So it actually looks like /Zc:throwingNew does not play nicely with secure CRT features. Or the optimizer is not smart enough to eliminate parameter checks after operator new.
how else (besides the NuGet) i can get my hand on the optimizer..
NuGet does not seem to work for me. I always get the following errors:
[Package source] An error occurred while sending the request.
The remote name could not be resolved: ‘packagesource’
[Package source 2] An error occurred while sending the request.
The remote name could not be resolved: ‘packagesource2’
[Package source 3] An error occurred while sending the request.
The remote name could not be resolved: ‘packagesource3’
[Package source 4] An error occurred while sending the request.
The remote name could not be resolved: ‘packagesource4’
Could you actually give more details on how you are running nuget?
The nugget executable requires at least .net 4.5 installed, do you have that?
Hi Derran
I just installed it and put VisualCppTolls in search box. When i run it it show me these errors. Never used Nuget Before. Never worked for any other searches.
Thanks
Adam Zielinski
@ Adam,
Package source should be set to “http://vcppdogfooding.azurewebsites.net/nuget/”.
Another way to get the toolset using Nuget CLI is as below.
“nuget install VisualCppTools -source http://vcppdogfooding.azurewebsites.net/nuget/ -Prerelease”
Please let us know if that works for you.
Thanks Iyyappa
after adding new link NuGet works. It looks like NuGet did not come with any usefull links
Thanks again
Adam Zielinski
Ran the new optimizer through some code that does spectral analysis. Executes 2% faster.
Does this mean that we’ll have a lot of updates in the next month for windows and the other apps? A new one with each release of the compiler?
Unlikely it will be this soon. Windows (and other teams) have a strict schedule when they take new compilers, because a lot of testing is involved to ensure everything works as expected.
Will this optimizer be integrated completly with the “regular” compiler in future releases of Visual Studio? or will it stay as a seperated package that one has to download?
other than that, great work!
The optimizer is already integrated and enabled by default, just that for now it is available as a NuGet package – this allows us to release externally compiler versions that are very close to the latest development version. It will be released officially as part of VS 2015 Update 3.
I have noticed a couple of sequences that MSVC’s lower level 64-bit codegen doesn’t seem to do.
The first is that a bitwise OR of a 16-bit value will a constant will never emit an or r16, imm16 instruction, instead it emits a mov r32, imm32; or r16, r16. (where the second r16 in the or is the lower half of the r32 in the mov)
The second is that doing a bitwise AND of a 64-bit value with 0xFFFFFFFFFFFF0000 to clear the lower 16-bits generates and r64, imm32, where imm32 is sign-extended to 64-bits. GCC 4.9 generates the much shorter sequence that achieves the same results: xor r16, r16.
The last that I’ve noticed so far is that MSVC seems to be over-aggressive when zero-extending a 32-bit value to 64-bits, and will emit the zero-extension even when the upper 32-bits of the value are going to be destroyed because it’s being shifted left 32-bits. In the particular case I noticed it, this meant it was generating a mov eax, eax; which would be correct, if the top 32-bits weren’t clobbered in the instruction right after it.
uint32_t b;
uint64_t a = ((uint64_t)b << 32) | b;
Do you have examples for the first and last cases? I don’t see a difference between VC++ and what GCC and LLVM generate.
I tested using these examples: https://godbolt.org/g/hX8hYb
The second case must be handled long after the optimizer runs, when the code is closer to the target architecture and the sub-registers are exposed. I’m going to let the team working on that part know about this issue.
Whoops, sorry I failed to actually check back after saying this.
I actually manage to trigger both the first and 3rd cases in the same function, which I use for case-insensitive hashing of identifiers (in the context of a compiler and a C-like language):
“`
size_t CaselessIdentifierHasher::doIdentifierHash(const char* s, size_t len) {
const char* cStr = s;
// We know the string is null-terminated, so we can align to 2.
size_t lenLeft = ((len + 1) & 0xFFFFFFFFFFFFFFFEULL);
size_t iterCount = lenLeft >> 2;
uint32_t val = 0x84222325U;
size_t i = iterCount;
while (i)
val = _mm_crc32_u32(val, ((uint32_t*)cStr)[–i] | 0x20202020);
if (lenLeft & 2)
val = _mm_crc32_u16(val, *(uint16_t*)(cStr + (iterCount * 4)) | (uint16_t)0x2020);
return ((size_t)val <> 2;
; 146 : uint32_t val = 0x84222325U;
0000b b8 25 23 22 84 mov eax, -2078137563 ; 84222325H
00010 4d 8b ca mov r9, r10
00013 49 c1 e9 02 shr r9, 2
; 147 : size_t i = iterCount;
00017 49 8b d1 mov rdx, r9
; 148 : while (i)
0001a 4d 85 c9 test r9, r9
0001d 74 19 je SHORT $LN3@doIdentifi
0001f 90 npad 1
$LL2@doIdentifi:
; 149 : val = _mm_crc32_u32(val, ((uint32_t*)cStr)[–i] | 0x20202020);
00020 41 8b 4c 90 fc mov ecx, DWORD PTR [r8+rdx*4-4]
00025 48 ff ca dec rdx
00028 81 c9 20 20 20
20 or ecx, 538976288 ; 20202020H
0002e f2 0f 38 f1 c1 crc32 eax, ecx
00033 48 85 d2 test rdx, rdx
00036 75 e8 jne SHORT $LL2@doIdentifi
$LN3@doIdentifi:
; 150 : if (lenLeft & 2)
00038 41 f6 c2 02 test r10b, 2
0003c 74 13 je SHORT $LN4@doIdentifi
; 151 : val = _mm_crc32_u16(val, *(uint16_t*)(cStr + (iterCount * 4)) | (uint16_t)0x2020);
0003e 43 0f b7 0c 88 movzx ecx, WORD PTR [r8+r9*4]
00043 ba 20 20 00 00 mov edx, 8224 ; 00002020H
00048 66 0b ca or cx, dx
0004b 66 f2 0f 38 f1
c1 crc32 eax, cx
$LN4@doIdentifi:
; 152 : return ((size_t)val << 32) | val;
00051 8b c8 mov ecx, eax
00053 8b c0 mov eax, eax
00055 48 c1 e0 20 shl rax, 32 ; 00000020H
00059 48 0b c1 or rax, rcx
; 153 : }
0005c c3 ret 0
“`
For issue 3, see the code attributed to "; 152 : return ((size_t)val << 32) | val;"
For issue 1, see the block immediately above that, in particular:
00043 ba 20 20 00 00 mov edx, 8224 ; 00002020H
00048 66 0b ca or cx, dx
There's also the issue of the dec for i being done with a separate test i, i later when it could be done by moving the dec to where the test is, where it would actually get fused as a single micro-op, but I haven't been able to find a way to structure that loop in such a way that any compiler would generate that :(
On a secondary side note, `size_t a = 0x84222325U; a = _mm_crc32_u32(a, a);` generates:
00000 b8 25 23 22 84 mov eax, -2078137563 ; 84222325H
00005 f2 0f 38 f1 c0 crc32 eax, eax
0000a 8b c0 mov eax, eax
As the compiler doesn't seem to be aware that the upper 32-bits are already zeroed by the crc32, probably because of the fact it's a builtin.
Well, that isn’t helpful… The site seems to have mangled my code and dropped an entire part of my response :( Rather than try and deal with whatever is causing that, here it is in a gist instead: https://gist.github.com/Orvid/16d13a3044224ff9d6fdcc7d04a24bf6
This is great news!!! You mention that the optimizer is enabled by default on all architectures. Do you have any results to share when targeting ARM?
Not that much focus was put on ARM yet, but the performance improvements more or less translate to ARM. Code size does seem to improve a bit more on ARM than x86/x64.
1) I echo the request for benchmark performance results. If you going to claim better optimization then back it up with some numbers please.
2) It looks like this will do a better job of removing dead code. Any chance we can get some output somewhere that this happened? It’s all too common that I find a bug in some code through inspection, spend a significant amount of time trying to reproduce it and find that the I can’t because it’s dead code. Gosh…wish the compiler or linker had told me!
To answer your questions:
1) There are multiple benchmarks with double-digit percent performance improvements – some are part of SPEC2k and SPEC2k6, some are from other popular applications and benchmarks. The improvements you might get depend a lot on the application and where the optimizations kick in – if it’s in a hot loop, it’s very likely to see an improvement. If not, you’re not seeing much. On average, code size does tend to decrease. The performance improvement can also depend on how an application is compiled – we saw different numbers when comparing -O2, -O2 with LTCG and -O2 with LTCG and profile info for the same application.
2) Reporting code that gets deleted during optimization is much too noisy and of little use. For example, after inlining it’s quite common to delete large parts of the inlined function (think branch/switch on a parameter that was replaced by a constant).
And if the entire function is deleted everywhere in the program? Don’t you think that would be useful information to the developer? That’s one more function I don’t have to maintain, test, or worry about.
BTW: I know that information is available by the linker if /verbose is used. However, it’s very hard to dig through and only works on release builds.
Let me put it another way…how do you know an optimization did anything if you don’t test it? Do you just look at the machine code and decide it’s better? That seems flawed. If you do test it for performance changes before and after the optimization, where are those numbers? Maybe the problem is that you don’t have a micro-benchmark test frameworks. They can be difficult to produce, but you only have to do it once.
So I’m not buying the excuses. Either each optimization made it better, or it didn’t. If you don’t know…that’s a huge problem. If you do know, share it please.
First, nobody stated they did not test it. (Not in his reply, nor in the psot) Second, most of those optimizations are provable (epically if number of ops goes down). And rest can be shown through various estimates, because of known qualities. (Like multiplication is always faster then division, recommendations in Intel’s and AMD’s optimization guides)
Also looking at final assembler output can tell a lot about changes. (even can help spot some kinds of bugs)
Also your entire post is addressing problems not in evidence and seems to not address post at all. (including alleged excuses)
ok…got it. By inspection then. I just wish you’d said up front that there are not numbers to give.
Guess I have more to say…. Do you know what vapor-ware is? It’s when someone says they have a piece of software that does something but they are unable or unwilling to demonstrate that it actually does what they say.
So, I applaud the change you are making and _hope_ that it results in faster code but until it is _demonstrated_ to do so…it’s just words (vapor).
I’m I AM addressing the post directly. It’s just words until you can show (with numbers) that your change makes a demonstrable difference. You’ve done that with the size of the executable but not with the speed.
@DavidLarson:
If you don’t trust them now when they say that they believe they are making improvements based on their private experiments, then why would you suddenly start trusting them when they post benchmark numbers about those improvements? Wouldn’t those numbers just be more unverified words from your perspective?
Also, it’s not even formally released yet. How are they supposed to demonstrate it? And what would constitute a valid demonstration to you? You clearly don’t trust them already. Do you need a live video to ensure they aren’t lying? How do you know the video isn’t spoofed?
If you want verification, and you don’t trust them, then you should do the verification yourself, or seek it from another party that you do trust. I don’t think it’s nice to harass them about their veracity when they’re just trying to let us know generally what’s going on, what’s coming down the line, and asking us for help in checking it out.
“I just wish you’d said up front that there are not numbers to give.”
David, you replied directly to their numbers, twice:
“There are multiple benchmarks with double-digit percent performance improvements – some are part of SPEC2k and SPEC2k6, some are from other popular applications and benchmarks. ”
“double-digit percent” means >= 10%
What specifically are you looking for that these numbers aren’t enough?
There are a couple of things to remember, especially with the whole “vapourware” comments.
First, the optimizer is available in the daily compiler builds, so it exists. This contradicts the possibility of it being vapourware.
Secondly, the benchmarks at this point would be worthless beyond showing that the optimizers do something. So until the optimizer code is complete and working as intended then the benchmark won’t be representative of the final code. Since they stated that they are targeting Update 3, then the version of the compiler that would be the most useful to test would be the Update 3 compiler.
Well see…hopefully, when Update 3 is ready, you’ll post actual numbers that demonstrate the utility of the optimizations.
You missed one key point – the new compiler is available for testing to everyone. Why don’t you test it and judge for yourself, on your own software? :)
Because when I go buy a car I avoid the sales people that say, “We don’t publish performance numbers, go find out for yourself!” or “Going from zero to 60 is an unrealistic scenario so we’re not going to doing any benchmarks”.
I honestly don’t understand the push-back. 1) claims better optimization 2) Is it measurable? 3) stop bothering us!
Weird.
[Sorry, this thing won’t let me reply to your further-nested comment.]
+100. Reminds me of a certain circus that goes ‘why dontcha pass the bill if you wanna find out what’s in it.’
Gratian,
You mentioned in the HN thread that the -d2SSAOptimizerUndefinedIntOverflow- flag would likely be renamed before Update 3, has that already happened?
cl.exe -d2SSAOptimizerUndefinedIntOverflow- bob.cc
Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24118 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
bob.cc
fatal error C1007: unrecognized flag ‘-SSAOptimizerUndefinedIntOverflow-‘ in ‘p2’
Hi Michael,
Yes, now it is -d2UndefIntOverflow-
I’m going to update the blog post.
Can we dump the intermediate representation code in this release?
Since this new C2 has supported the SSA form, so I’m not sure can we do that or not?
No, the IR cannot be dumped in the release build that is part of the NuGet package.
It’s possible to view the final assembly using the /FA and /FAs flags though:
https://msdn.microsoft.com/en-us/library/367y26c6.aspx
I seem to be getting the following internal compiler error on a single file project. Compiler 24116, X64 Release.
Builds fine without warnings or errors in Debug.
1>—— Rebuild All started: Project: ip_binary_search, Configuration: Release x64 ——
1> Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24116 for x64
1> Copyright (C) Microsoft Corporation. All rights reserved.
1>
1> cl /c /I”C:\Program Files (x86)\Visual Leak Detector\include” /Zi /W3 /WX- /Ox /Oi /Os /Oy- /GL /D NDEBUG /D _CONSOLE /D _UNICODE /D UNICODE /Gm- /EHsc /MD /GS- /Gy /arch:AVX /fp:precise /Zc:wchar_t /Zc:forScope /Zc:inline /Fo”x64\Release\\” /Fd”x64\Release\vc140.pdb” /Gd /TP /analyze- /errorReport:prompt main.cpp
1>
1> main.cpp
1> Generating code
1>c:\users\richardnutman\documents\visual studio 2015\projects\ip_binary_search\ip_binary_search\main.cpp : fatal error C1001: An internal error has occurred in the compiler.
1> (compiler file ‘f:\dd\vctools\compiler\utc\src\p2\main.c’, line 249)
1> To work around this problem, try simplifying or changing the program near the locations listed above.
1> Please choose the Technical Support command on the Visual C++
1> Help menu, or open the Technical Support help file for more information
1>
1>c:\users\richardnutman\documents\visual studio 2015\projects\ip_binary_search\ip_binary_search\main.cpp : fatal error C1001: An internal error has occurred in the compiler.
1> (compiler file ‘f:\dd\vctools\compiler\utc\src\p2\main.c’, line 249)
1> To work around this problem, try simplifying or changing the program near the locations listed above.
1> Please choose the Technical Support command on the Visual C++
1> Help menu, or open the Technical Support help file for more information
1>
1>
1>LINK : fatal error LNK1000: Internal error during IMAGE::BuildImage
1>
1> Version 14.00.24116.0
1>
1> ExceptionCode = C0000005
1> ExceptionFlags = 00000000
1> ExceptionAddress = 035F5BCE (034A0000) “C:\Users\richardnutman\Documents\Visual Studio 2015\Projects\ip_binary_search\packages\VisualCppTools.14.0.24116-Pre\lib\native\bin\x86_amd64\c2.dll”
1> NumberParameters = 00000002
1> ExceptionInformation[ 0] = 00000000
1> ExceptionInformation[ 1] = 00000024
1>
1> CONTEXT:
1> Eax = 00000000 Esp = 0034E7A0
1> Ebx = 04A85DFC Ebp = 0034E7F8
1> Ecx = 0061EFB8 Esi = 00000000
1> Edx = 00000002 Edi = FFFFE369
1> Eip = 035F5BCE EFlags = 00010246
1> SegCs = 00000023 SegDs = 0000002B
1> SegSs = 0000002B SegEs = 0000002B
1> SegFs = 00000053 SegGs = 0000002B
1> Dr0 = 00000000 Dr3 = 00000000
1> Dr1 = 00000000 Dr6 = 00000000
1> Dr2 = 00000000 Dr7 = 00000000
========== Rebuild All: 0 succeeded, 1 failed, 0 skipped ==========
Nevermind, seems 24124 has this fixed.
No offense but this NuGet thing is really pain and has really odd behavior.
Yes I read those “help” and blog links but none of it seems to work.
Should this work as described with Windows 8.1 or must it be Windows 10?
After messing with the thing for some time I figured out the easiest way was to open up a project in VS2015 and use the “Manage NuGet Packages..”. The result of this is installing it (all 1.23gb worth) into the root of my C++ project folder “\packages\VisualCppTools.14.0.24127-Pre”.
Is this how it should be installed?
Is there a way to make it the global compiler choice, at least temporary?
NuGet seems to be more for downloading and associating a library with a project, or all projects in a solution.
One way to make the compiler from the package the default – and also available from the command line – is to copy the package content to the Program Files directory. It’s quite an ugly solution, but it seems to work well:
– The default binaries are located in C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin. Make a backup of the entire folder.
– Copy all files from packages\VisualCppTools.14.0.24127-Pre\lib\native\bin over, overwriting everything.
– You should be able now to compile any project using the new compiler.
– When you are done, restore the files you backed up.
You really make it appear really easy together with your presentation however I find this matter to be actually something which I think I’d never understand. It sort of feels too complicated and extremely broad for me. I’m looking ahead for your next submit, I will try to get the grasp of it!
Stt buồn về tình yêu
Is this in VS.2015 Update 3? It doesn’t seem to be in the release notes.
Update 06/10/2016: The new optimizer is now also available as part of Visual Studio Update 3 RC.
And yes, its in Update 3.
Wow, that’s what I was searching for, what a stuff!
existing here at this blog, thanks admin of this web page.
I have been running some tests with Update 3 and the performance of some of my x64 C/C++ benchmarks nearly doubled or tripled over the older compiler. Great work!!
However It seems the optimizer is very sensitive to certain compiler switches. For example, if I turn OFF buffer security check /GS- , it seems to disable the new optimizations. If I select /O3 over /O2, I actually see the performance benefit go away as well, surprisingly. The best performance is just to leave everything at the default for a new Release build…
Hi,
Glad to hear about the improvements you’re seeing!
What exactly do you compare? Update 3 optimized vs. Update 2 (or earlier) optimized? Note that there is no -O3 option for VC++: -O2 is optimize for speed, -O1 is optimize for size. /GS- should not affect things very much, but it might change the code enough for the optimizer to miss some cases.
Thanks,
Gratian
What are the chances of getting something like GCC’s __builtin_expect (aka likely/unlikely)? That’s still my most missed compiler/optimiser feature, and it’s a very easy way to tell the compiler which blocks are less likely to execute (and thus better hidden behind a branch than the other).
Prior arguments at not implementing it seem to be that “people would get this wrong, use POGO instead”, but POGO seems a lot more cumbersome and error prone to me — it’s typically very obvious to a human which code paths are less likely (especially error-handling paths — and where not obvious the hints can be omitted), and it seems much more sensible to bake this knowledge into the code itself as it’s written rather than relying on a profile run to replicate it every time.
Gavin,
The argument for not implementing this feature is that it is non-standard. MSVC is pushing to implement standards features, not extend the language in ways that are incompatible with other compilers. (We’ve done too much of that in our past.)
There is a standards proposal to introduce such an attribute. When it is standardized, we will implement it: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0627r0.pdf.
Andrew
Is the new code optimizer available in Visual Studio 2017?
Yes. The compiler toolset that ships in VS 2017 (v141) builds on the compiler toolset that shipped in VS 2015 Update 3 (v140.)
This new framework will ship in all of the MSVC compiler toolsets for the forseeable future.