RyuJIT is the just-in-time compiler used by .NET Core on x64 and now x86 and by the .NET Framework on x64 to compile MSIL bytecode to native machine code when a managed assembly executes. I’d like to point out some of the past year’s improvements that have gone into RyuJIT, and how they make the generated code faster.
What follows is by no means a comprehensive list of RyuJIT optimization improvements, but rather a few hand-picked examples that should make for a fun read and point to some of the issues and pull requests on GitHub that highlight the great community interactions and contributions that have helped shape this work. Be sure to also check out Stephen Toub’s recent post about performance improvements in the runtime and base class libraries, if you haven’t already.
This post will be comparing the performance of RyuJIT in .NET Framework 4.6.2 to its performance in .NET Core 2.0 and .NET Framework 4.7.1. Note that .NET Framework 4.7.1 has not yet shipped and I am using an early private build of the product. The same RyuJIT compiler sources are shared between .NET Core and .NET Framework, so the compiler changes discussed here are present in both .NET Core 2.0 and .NET Framework 4.7.1 builds.
NOTE: Code examples included in this post use manual Stopwatch
invocations, with arbitrarily fixed iteration counts and no statistical analysis, as a zero-dependency way to corroborate known large performance deltas. The timings quoted below were collected on the same machine, with compared runs executed back-to-back, but even so it would be ill-advised to extrapolate quantitative results; they serve only to confirm that the optimizations improve the performance of the targeted code sequences rather than degrade it. Active performance work, of course, demands real benchmarking, which comes with a whole host of subtle issues that it is well worth taking a dependency to manage properly. Andrey Akinshin recently wrote a great blog post discussing this, using the code snippets from Stephen’s post as examples. He will publish a follow-on post to this one with additional benchmarks soon. Thanks Andrey!
Devirtualization
The machine code sequence that the just-in-time compiler emits for a virtual call necessarily involves some degree of indirection, so that the correct target override method can be determined when the machine code executes. Compared to a direct call, this indirection imposes nontrivial overhead. RyuJIT can now identify that certain virtual call sites will always have one particular target override, and replace those virtual calls with direct ones. This avoids the overhead of the virtual indirection and, better still, allows inlining the callee method into the callsite, eliminating call overhead entirely and giving optimizations better insight into the effects of the code. This can happen when the target object has sealed type, or when its allocation site is immediately apparent and thus its exact type is known. This optimization was introduced to RyuJIT in dotnet/coreclr #9230; was subsequently improved by dotnet/coreclr #10192, dotnet/coreclr #10432, and dotnet/coreclr #10471; and has plenty more room for improvement.
The PRs for the changes include some statistics (e.g. 7.3% of virtual calls in System.Private.CoreLib get devirtualized) and real-world examples (e.g. this diff in ConcurrentStack.GetEnumerator()
— to see the code diff at that link you may have to scroll past the quoted output from jit-diff
, which is a tool we use for assessing compiler change impact. It reports any code size increase as a “regression”, though in this case the code size increases are likely from enabling inlines, which is actually an improvement). Here’s a minimal example to illustrate the optimization in action:
using System; | |
using System.Diagnostics; // for Stopwatch | |
using System.Runtime.CompilerServices; // for MethodImpl | |
public abstract class Operation // abstract unary integer operation | |
{ | |
public abstract int Operate(int input); | |
public int OperateTwice(int input) => Operate(Operate(input)); // two virtual calls to Operate | |
} | |
public sealed class Increment : Operation // concrete, sealed operation: increment by fixed amount | |
{ | |
public readonly int Amount; | |
public Increment(int amount = 1) { Amount = amount; } | |
public override int Operate(int input) => input + Amount; | |
} | |
class Test // driver class | |
{ | |
int input; // input for test method PostDoubleIncrememnt | |
int output; // output for ^ | |
[MethodImpl(MethodImplOptions.NoInlining)] | |
int PostDoubleIncrement(Increment inc) // Parameter type is sealed Incremement class | |
{ | |
output = inc.OperateTwice(input); // inlining OperateTwice brings in two virtual calls to Operate | |
return input; // returns input unchanged, but virtual calls obscure the unchanged-ness | |
} | |
public static int Main(string[] args) | |
{ | |
var inc = new Increment(); | |
var test = new Test() { input = 12 }; | |
while (true) | |
{ | |
var sw = Stopwatch.StartNew(); | |
for (int i = 0; i < 100000000; i++) | |
{ | |
test.PostDoubleIncrement(inc); | |
test.input = test.output; | |
} | |
Console.WriteLine(sw.Elapsed); | |
} | |
} | |
} |
Method Operation.OperateTwice
takes an instance parameter of abstract type Operation
, and makes two virtual calls to its Operate
method.
When run with the version of the RyuJIT compiler included in .NET Framework 4.6.2, OperateTwice
is inlined into Test.PostDoubleIncrement
, leaving PostDoubleIncrement
with two virtual calls:
; Assembly listing for method Test:PostDoubleIncrement(ref):int:this | |
; Emitting BLENDED_CODE for X64 CPU with SSE2 | |
; optimized code | |
; rsp based frame | |
; partially interruptible | |
; Final local variable assignments | |
; | |
; V00 this [V00,T01] ( 5, 5 ) ref -> rdi this | |
; V01 arg1 [V01,T00] ( 6, 6 ) ref -> rsi | |
; V02 tmp0 [V02,T02] ( 2, 4 ) int -> rdx | |
; V03 tmp1 [V03,T03] ( 2, 4 ) int -> rdx | |
; V04 OutArgs [V04 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] | |
; | |
; Lcl frame size = 40 | |
G_M20159_IG01: | |
57 push rdi ; set up frame | |
56 push rsi | |
4883EC28 sub rsp, 40 | |
488BF9 mov rdi, rcx | |
488BF2 mov rsi, rdx | |
G_M20159_IG02: | |
8B5708 mov edx, dword ptr [rdi+8] ; load this.input | |
488BCE mov rcx, rsi ; virtual call sequence (put target object in argument register) | |
488B06 mov rax, qword ptr [rsi] ; virtual call sequence (load object's type info) | |
488B4040 mov rax, qword ptr [rax+64] ; virtual call sequence (load vtable) | |
FF5020 call qword ptr [rax+32]Operation:Operate(int):int:this ; virtual call (indirect via vtable slot) | |
8BD0 mov edx, eax | |
488BCE mov rcx, rsi ; 2nd virtual call sequence (put target object in argument register) | |
488B06 mov rax, qword ptr [rsi] ; 2nd virtual call sequence (load object's type info) | |
488B4040 mov rax, qword ptr [rax+64] ; 2nd virtual call sequence (load vtable) | |
FF5020 call qword ptr [rax+32]Operation:Operate(int):int:this ; 2nd virtual call (indirect via vtable slot) | |
89470C mov dword ptr [rdi+12], eax | |
8B4708 mov eax, dword ptr [rdi+8] ; (re-)load this.input, into return register | |
G_M20159_IG03: | |
4883C428 add rsp, 40 ; tear down frame | |
5E pop rsi | |
5F pop rdi | |
C3 ret ; return | |
; Total bytes of code 56, prolog size 6 for method Test:PostDoubleIncrement(ref):int:this | |
; ============================================================ |
When run with the version of RyuJIT included in .NET Core 2.0 and .NET Framework 4.7.1, OperateTwice
is again inlined into Test.PostDoubleIncrement
, but the JIT can now recognize that the instance argument to the two virtual calls pulled in by that inlining is PostDoubleIncrement
‘s parameter inc
, which is of sealed type Increment
. This allows it to rewrite the virtual calls as direct calls to the Incremement.Operate
override of Operation.Operate
, and even inline those calls into PostDoubleIncrement
, which in turn exposes the fact that this code sequence doesn’t modify instance field input
, allowing the redundant load of it for the return value to be eliminated:
; Assembly listing for method Test:PostDoubleIncrement(ref):int:this | |
; Emitting BLENDED_CODE for X64 CPU with SSE2 | |
; optimized code | |
; rsp based frame | |
; partially interruptible | |
; Final local variable assignments | |
; | |
; V00 this [V00,T01] ( 7, 7 ) ref -> rcx this class-hnd | |
; V01 arg1 [V01,T05] ( 4, 4 ) ref -> rdx class-hnd | |
; V02 tmp0 [V02,T02] ( 4, 8 ) int -> r8 | |
; V03 tmp1 [V03,T00] ( 5, 10 ) int -> r8 | |
;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] | |
; V05 cse0 [V05,T03] ( 7, 7 ) int -> rdx | |
; V06 cse1 [V06,T04] ( 6, 6 ) int -> rax | |
; | |
; Lcl frame size = 0 | |
G_M42303_IG01: | |
G_M42303_IG02: | |
8B4108 mov eax, dword ptr [rcx+8] ; load this.input directly into return register | |
448BC0 mov r8d, eax | |
8B5208 mov edx, dword ptr [rdx+8] ; load increment amount | |
4403C2 add r8d, edx ; increment once | |
4403C2 add r8d, edx ; increment again (no reloads) | |
4489410C mov dword ptr [rcx+12], r8d ; store result in this.output | |
G_M42303_IG03: | |
C3 ret ; return -- input is already in return register | |
; Total bytes of code 20, prolog size 0 for method Test:PostDoubleIncrement(ref):int:this | |
; ============================================================ |
The optimized version does of course run faster; here’s what I see running locally on .NET Framework 4.6.2:
00:00:00.7389248
00:00:00.7390185
00:00:00.7343929
00:00:00.7355264
00:00:00.7350114
and here’s what I see running locally on .NET Core 2.0:
00:00:00.4671669
00:00:00.4676545
00:00:00.4683338
00:00:00.4674685
00:00:00.4673269
Enhanced Range Inference
One key goal of JIT compiler optimizations is to reduce the cost of run-time safety checks by eliding the code for them when it can prove they will succeed; this necessarily falls to the JIT since the checks are explicitly dictated by MSIL semantics. RyuJIT’s optimizer accordingly focuses some of its analysis on array index expressions, to see whether it can prove they are in-bounds. @mikedn‘s change dotnet/coreclr #9773 extended this analysis to recognize the common idiom of using an unsigned comparison to check upper and lower bounds in one check ((uint)i < (uint)a.len
implies both i >= 0
and i < a.len
for signed i
). The PR notes how this trimmed the machine code generated for List.Add
from 68 bytes to 48 bytes, and here’s a minimal illustrative example:
using System; | |
using System.Diagnostics; // for Stopwatch | |
class Test | |
{ | |
int[] data; | |
void Set(int index, int datum) | |
{ | |
if ((uint) index >= (uint)data.Length) // validate arguments | |
{ | |
RaiseArgRangeException(); | |
} | |
data[index] = datum; // array access implies bounds check, but preceding code proves it will succeed | |
} | |
void RaiseArgRangeException() { throw new ArgumentOutOfRangeException(); } | |
static Test Instance = new Test() { data = new int[128] }; | |
public static int Main(string[] args) | |
{ | |
var test = Instance; | |
while (true) | |
{ | |
var sw = Stopwatch.StartNew(); | |
for (int i = 0; i < 300000000; i++) | |
{ | |
test.Set(i & 127, i); | |
} | |
Console.WriteLine(sw.Elapsed); | |
} | |
} | |
} |
Method Set
validates its index
argument, and then stores to the array. The IL generated by the C# compiler for this method looks like so:
.method private hidebysig instance void Set(int32 index, | |
int32 datum) cil managed | |
{ | |
// Code size 27 (0x1b) | |
.maxstack 8 | |
IL_0000: ldarg.1 | |
IL_0001: ldarg.0 | |
IL_0002: ldfld int32[] Test::data | |
IL_0007: ldlen | |
IL_0008: conv.i4 | |
IL_0009: blt.un.s IL_0011 // Compare-and-branch from source "if ((uint) index >= (uint)data.Length)" | |
IL_000b: ldarg.0 | |
IL_000c: call instance void Test::RaiseArgRangeException() | |
IL_0011: ldarg.0 | |
IL_0012: ldfld int32[] Test::data | |
IL_0017: ldarg.1 | |
IL_0018: ldarg.2 | |
IL_0019: stelem.i4 // Store array element operation carries implied bounds check | |
IL_001a: ret | |
} // end of method Test::Set | |
The IL has a blt.un
instruction for the argument validation, and a subsequent stelem
instruction for the store that carries an implied bounds check of its own. When run with the version of the RyuJIT compiler included in .NET Framework 4.6.2, machine instructions are generated for each of these checks; here’s what the machine code for the inner loop from the Main
method (into which Set
gets inlined) looks like:
G_M45588_IG05: | |
448BC8 mov r9d, eax | |
4183E17F and r9d, 127 | |
453BC1 cmp r8d, r9d ; compare from blt.un IL opcode (from source compare) | |
7642 jbe SHORT G_M45588_IG08 ; branch from ^ | |
G_M45588_IG06: | |
4C8BD2 mov r10, rdx | |
453BC8 cmp r9d, r8d ; compare for bounds check implied by stelem IL opcode | |
7343 jae SHORT G_M45588_IG09 ; branch for ^ | |
4D63C9 movsxd r9, r9d | |
4389448A10 mov dword ptr [r10+4*r9+16], eax ; store for stelem IL opcode | |
FFC0 inc eax | |
3D00A3E111 cmp eax, 0x11E1A300 | |
7CDB jl SHORT G_M45588_IG05 |
When run with the version of RyuJIT included in .NET Core 2.0 and .NET Framework 4.7.1, on the other hand, the compiler recognizes that the explicit check for the argument validation ensures that the subsequent check for the stelem
instruction will always succeed, and omits the redundant check, producing this machine code:
G_M32275_IG05: | |
448BC1 mov r8d, ecx | |
4183E07F and r8d, 127 | |
413BD0 cmp edx, r8d ; compare from blt.un IL opcode (from source compare) | |
7641 jbe SHORT G_M32275_IG08 ; branch from ^ | |
G_M32275_IG06: | |
; No bounds check here -- optimizer proved it would succeed and elided it | |
4C8BC8 mov r9, rax | |
4D63C0 movsxd r8, r8d | |
43894C8110 mov dword ptr [r9+4*r8+16], ecx ; store for stelem IL opcode | |
FFC1 inc ecx | |
81F900A3E111 cmp ecx, 0x11E1A300 | |
7CDF jl SHORT G_M32275_IG05 |
Importantly, this brings the machine code in line with what one might expect from looking at the source code — a check for argument validation, followed by a store to the backing array. Also, executing the code reports a speedup as expected — running on .NET Framework 4.6.2 gives me output like this:
00:00:00.4313988
00:00:00.4313209
00:00:00.4320729
00:00:00.4319180
00:00:00.4316375
and running on .NET Core 2.0 gives me output like this:
00:00:00.3235982
00:00:00.3237021
00:00:00.3250067
00:00:00.3235947
00:00:00.3236944
Finally Cloning
When it comes to exception handling and performance, one key goal is to minimize the cost that exception handling constructs impose on the non-exception path — if no exceptions are actually raised when the program runs, then (as much as possible) it should run as fast as it would if it didn’t have exception handlers at all. This poses a challenge for finally
clauses, which execute on both exception and non-exception paths. In order to correctly support the exception path, the code of the finally
must be bracketed by some set-up/tear-down code that facilitates being called from the runtime code that handles exception dispatch. Let’s look at an example:
using System; | |
using System.Diagnostics; // for Stopwatch | |
using System.Runtime.CompilerServices; // for MethodImpl | |
class Test | |
{ | |
[MethodImpl(MethodImplOptions.NoInlining)] | |
static void Update(ref int left, ref int right) | |
{ | |
try | |
{ | |
left = checked(left + 1); // incrememnt `left` (just to have something to do inside `try`), with checked arithmetic so exception path exists | |
} | |
finally | |
{ | |
right = right + 1; // increment `right` (just to have something to do inside `finally`) | |
} | |
} | |
public static int Main(string[] args) | |
{ | |
while (true) | |
{ | |
int left = 0, right = 0; | |
var sw = Stopwatch.StartNew(); | |
for (int i = 0; i < 100000000; i++) | |
{ | |
Update(ref left, ref right); | |
} | |
Console.WriteLine(sw.Elapsed); | |
} | |
} | |
} |
Method Update
has a finally
clause that increments ref parameter right
. The actual increment boils down to a single machine instruction (add dword ptr [rax], 1
), but interaction with the runtime’s exception dispatch mechanism requires 5 extra instructions prior and 3 instructions after. The exception dispatch code invokes the finally
handler by calling it, and with the version of RyuJIT included with .NET Framework 4.6.2, the non-exception path of method Update
similarly uses a call
instruction to transfer control to the finally
code. Here’s what the machine code for method Update
looks like with that version of the compiler:
; Assembly listing for method Test:Update(byref,byref) | |
; Emitting BLENDED_CODE for X64 CPU with SSE2 | |
; optimized code | |
; rbp based frame | |
; fully interruptible | |
; Final local variable assignments | |
; | |
; V00 arg0 [V00,T00] ( 4, 4 ) byref -> rcx | |
; V01 arg1 [V01,T01] ( 4, 4 ) byref -> [rbp+0x18] do-not-enreg[H] | |
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] | |
; V03 PSPSym [V03 ] ( 1, 1 ) long -> [rbp-0x10] do-not-enreg[X] addr-exposed | |
; | |
; Lcl frame size = 48 | |
G_M62690_IG01: | |
55 push rbp ; set up call frame | |
4883EC30 sub rsp, 48 | |
488D6C2430 lea rbp, [rsp+30H] | |
488965F0 mov qword ptr [rbp-10H], rsp | |
48895518 mov bword ptr [rbp+18H], rdx ; store pointer to `right` in frame | |
G_M62690_IG02: | |
8B01 mov eax, dword ptr [rcx] | |
83C001 add eax, 1 ; compute `left + 1` | |
7004 jo SHORT G_M62690_IG03 ; raise exception on overflow | |
8901 mov dword ptr [rcx], eax ; store new value to `left` | |
EB06 jmp SHORT G_M62690_IG04 | |
G_M62690_IG03: | |
E8CEF3AB5F call CORINFO_HELP_OVERFLOW | |
CC int3 | |
G_M62690_IG04: | |
488BCC mov rcx, rsp | |
E807000000 call G_M62690_IG07 ; non-excpetion path: call finally handler | |
G_M62690_IG05: | |
90 nop | |
G_M62690_IG06: | |
488D6500 lea rsp, [rbp] ; tear down call frame | |
5D pop rbp | |
C3 ret ; return | |
;; Code above is all of method `Update` except for the `finally` clause | |
;; Code below is all the `finally` clause of method `Update` | |
G_M62690_IG07: | |
55 push rbp ; set up a new frame for the `finally` handler | |
4883EC30 sub rsp, 48 | |
488B6920 mov rbp, qword ptr [rcx+32] ; find the original frame for the `Update` method | |
48896C2420 mov qword ptr [rsp+20H], rbp | |
488D6D30 lea rbp, [rbp+30H] | |
G_M62690_IG08: | |
488B4518 mov rax, bword ptr [rbp+18H] ; load pointer to `right` from parent frame | |
830001 add dword ptr [rax], 1 ; increment `right` | |
G_M62690_IG09: | |
4883C430 add rsp, 48 ; tear down handler frame | |
5D pop rbp | |
C3 ret ; return (to either exception dispatch code or `Update`) | |
; Total bytes of code 81, prolog size 18 for method Test:Update(byref,byref) | |
; ============================================================ |
Thanks to change dotnet/coreclr #8551, the version of RyuJIT included with .NET Core 2.0 makes a separate copy of the finally
handler body, which executes on the non-exception path, and only on the non-exception path, and therefore doesn’t need any of the code that interacts with the exception dispatch code. The result (for this simple finally
) is a simple inc dword ptr [rax]
in lieu of the call
to the finally
handler code. Here’s what the machine code for method Update
looks like on .NET Core 2.0:
; Assembly listing for method Test:Update(byref,byref) | |
; Emitting BLENDED_CODE for X64 CPU with SSE2 | |
; optimized code | |
; rbp based frame | |
; fully interruptible | |
; Final local variable assignments | |
; | |
; V00 arg0 [V00,T01] ( 4, 4 ) byref -> rcx | |
; V01 arg1 [V01,T00] ( 6, 4 ) byref -> [rbp+0x18] do-not-enreg[H] | |
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] | |
; V03 PSPSym [V03 ] ( 1, 1 ) long -> [rbp-0x10] do-not-enreg[X] addr-exposed | |
; | |
; Lcl frame size = 48 | |
G_M16352_IG01: | |
55 push rbp ; set up call frame | |
4883EC30 sub rsp, 48 | |
488D6C2430 lea rbp, [rsp+30H] | |
488965F0 mov qword ptr [rbp-10H], rsp | |
48895518 mov bword ptr [rbp+18H], rdx ; store pointer to `right` in frame | |
G_M16352_IG02: | |
8B01 mov eax, dword ptr [rcx] | |
83C001 add eax, 1 ; compute `left + 1` | |
7004 jo SHORT G_M16352_IG03 ; raise exception on overflow | |
8901 mov dword ptr [rcx], eax ; store new value to `left` | |
EB06 jmp SHORT G_M16352_IG04 | |
G_M16352_IG03: | |
E80E0F395F call CORINFO_HELP_OVERFLOW | |
CC int3 | |
G_M16352_IG04: | |
488B4518 mov rax, bword ptr [rbp+18H] ; load pointer to `right` (not really necessary; future optimizations should clean this up) | |
FF00 inc dword ptr [rax] ; non-exception path: increment `right` | |
G_M16352_IG05: | |
488D6500 lea rsp, [rbp] ; tear down call frame | |
5D pop rbp | |
C3 ret ; return | |
;; Code above is all of method `Update` except for the `finally` clause | |
;; Code below is all the `finally` clause of method `Update` | |
G_M16352_IG06: | |
55 push rbp ; set up a new frame for the `finally` handler | |
4883EC30 sub rsp, 48 | |
488B6920 mov rbp, qword ptr [rcx+32] ; find the original frame for the `Update` method | |
48896C2420 mov qword ptr [rsp+20H], rbp | |
488D6D30 lea rbp, [rbp+30H] | |
G_M16352_IG07: | |
488B4518 mov rax, bword ptr [rbp+18H] ; load pointer to `right` from parent frame | |
FF00 inc dword ptr [rax] ; increment `right` | |
G_M16352_IG08: | |
4883C430 add rsp, 48 ; tear down handler frame | |
5D pop rbp | |
C3 ret ; return (to exception dispatch code) | |
; Total bytes of code 77, prolog size 18 for method Test:Update(byref,byref) | |
; ============================================================ |
(Note: As mentioned above, .NET Core and .NET Framework share RyuJIT compiler sources. In the case of this particular optimization, however, since the Thread.Abort
mechanism that exists in .NET Framework but not .NET Core requires the optimization to perform extra work that’s not yet implemented, the compiler includes a check that disables this optimization when running on .NET Framework.)
It’s worth noting that, in terms of C# source code, this optimization applies not just to finally
statements, but also to other constructs which are implemented using MSIL finally
clauses, such as using
statements and foreach
statements involving enumerators that implement IDisposable
.
As usual, the PR reports some stats (e.g. 3,000 affected methods in frameworks libraries). The example above gives me output like this running on .NET Framework 4.6.2:
00:00:00.8864647
00:00:00.8871649
00:00:00.8858654
00:00:00.8844547
00:00:00.8863496
and output like this running on .NET Core 2.0:
00:00:00.3945198
00:00:00.3943679
00:00:00.3954488
00:00:00.3944719
00:00:00.3948235
00:00:00.3942550
00:00:00.3943774
Shift Count Mask Removal
Generating machine code for bit-shift operations is surprisingly nuanced. Many software languages and hardware ISAs (and compiler intermediate representations) include bit-shift instructions, but in the case that the nominal shift amount is greater than or equal to the number of bits in the shifted value’s type, they have differing conventions (as do different programmer’s expectations): some interpret the shift amount modulo the number of bits, some produce the value zero (all bits “shifted out”), and some leave the result unspecified. MSIL’s shl
and shr.un
instructions’ results are undefined in these cases (perhaps to allow the JIT compiler to simply lower these to corresponding target machine instructions regardless of the target’s convention). C#’s <<
and >>
operators, on the other hand, always have a defined result, interpreting the shift amount modulo the number of bits. To ensure the correct semantics, therefore, the C# compiler must emit explicit MSIL instructions to perform the modulus/bit-mask operation on the shift amount before feeding it to shl
/shr
. Since the x86, x64, and ARM64 ISAs’ shift instructions likewise interpret the shift amount modulo the number of bits, on these targets a single hardware instruction can be used for the whole mask+shift sequence emitted by the C# compiler. @mikedn‘s change dotnet/coreclr #11594 taught RyuJIT to recognize these sequences and shrink them down appropriately. The PR reports stats showing this firing in 20 different framework assemblies, and, as always, a minimal illustrative example follows:
using System; | |
using System.Diagnostics; // for Stopwatch | |
class Test | |
{ | |
static int LeftShift(int bits, int amount) => bits << amount; | |
public static int Main(string[] args) | |
{ | |
while (true) | |
{ | |
int a = 1, b = 1, c = 1, d = 1; | |
var sw = Stopwatch.StartNew(); | |
for (int i = 0; i < 200000000; ++i) | |
{ | |
// Shift several times per iteration so the loop branching | |
// overhead doesn't hide the cost of the shifts. | |
a = LeftShift(a, d); | |
b = LeftShift(b, a); | |
c = LeftShift(c, b); | |
d = LeftShift(d, c); | |
} | |
_ = a + b + c + d; | |
Console.WriteLine(sw.Elapsed); | |
} | |
} | |
volatile static int _; | |
} |
Method LeftShift
uses only C#’s <<
operator, and its IL includes the modulo operation (as ldc 31
+ and
) to ensure those semantics:
.method private hidebysig static int32 LeftShift(int32 bits, | |
int32 amount) cil managed | |
{ | |
// Code size 7 (0x7) | |
.maxstack 8 | |
IL_0000: ldarg.0 | |
IL_0001: ldarg.1 | |
IL_0002: ldc.i4.s 31 // load bitmask | |
IL_0004: and // mask shift amount | |
IL_0005: shl // perform shift | |
IL_0006: ret | |
} // end of method Test::LeftShift | |
Running this with the version of the RyuJIT compiler included with .NET Framework 4.6.2, these masking operations are mechanically translated to hardware and
instructions; here’s what the inner loop of method Main
(into which LeftShift
gets inlined) looks like:
G_M22508_IG04: | |
8BCD mov ecx, ebp | |
83E11F and ecx, 31 | |
D3E6 shl esi, cl | |
8BCE mov ecx, esi | |
83E11F and ecx, 31 | |
D3E7 shl edi, cl | |
8BCF mov ecx, edi | |
83E11F and ecx, 31 | |
D3E3 shl ebx, cl | |
8BCB mov ecx, ebx | |
83E11F and ecx, 31 | |
D3E5 shl ebp, cl | |
FFC0 inc eax | |
3D00C2EB0B cmp eax, 0xBEBC200 | |
7CDB jl SHORT G_M22508_IG04 | |
Running this with the RyuJIT compiler built from the current master
branch (this particular change was merged after forking the .NET Core 2.0 release branch, and so will be included in versions after 2.0), on the other hand, the redundant masking is eliminated:
G_M65390_IG04: | |
8BCD mov ecx, ebp | |
D3E6 shl esi, cl | |
8BCE mov ecx, esi | |
D3E7 shl edi, cl | |
8BCF mov ecx, edi | |
D3E3 shl ebx, cl | |
8BCB mov ecx, ebx | |
D3E5 shl ebp, cl | |
FFC0 inc eax | |
3D00C2EB0B cmp eax, 0xBEBC200 | |
7CE7 jl SHORT G_M65390_IG04 |
Running this example code on .NET Framework 4.6.2, I see output like this:
00:00:00.8666592
00:00:00.8644551
00:00:00.8623416
00:00:00.8625029
00:00:00.8621675
Running it on .NET Core built from current master
branch, I see output like this:
00:00:00.5767756
00:00:00.5747216
00:00:00.5753256
00:00:00.5747212
00:00:00.5751126
Conclusion
There’s a lot of work going on in the JIT. I hope this small sampling has provided a fun read, and invite anyone interested to join the community pushing this work forward; there’s some documentation on RyuJIT available, and active work is typically labelled codegen and/or optimization. Performance is a constant focus in JIT work, and we’ve got some exciting improvements in the pipeline (like tiered jitting), so stay tuned, and let us know what would really help light up your scenarios!
0 comments