Performance Improvements in RyuJIT in .NET Core and .NET Framework

RyuJIT is the just-in-time compiler used by .NET Core on x64 and now x86 and by the .NET Framework on x64 to compile MSIL bytecode to native machine code when a managed assembly executes. I’d like to point out some of the past year’s improvements that have gone into RyuJIT, and how they make the generated code faster.

What follows is by no means a comprehensive list of RyuJIT optimization improvements, but rather a few hand-picked examples that should make for a fun read and point to some of the issues and pull requests on GitHub that highlight the great community interactions and contributions that have helped shape this work. Be sure to also check out Stephen Toub’s recent post about performance improvements in the runtime and base class libraries, if you haven’t already.

This post will be comparing the performance of RyuJIT in .NET Framework 4.6.2 to its performance in .NET Core 2.0 and .NET Framework 4.7.1. Note that .NET Framework 4.7.1 has not yet shipped and I am using an early private build of the product. The same RyuJIT compiler sources are shared between .NET Core and .NET Framework, so the compiler changes discussed here are present in both .NET Core 2.0 and .NET Framework 4.7.1 builds.

NOTE: Code examples included in this post use manual Stopwatch invocations, with arbitrarily fixed iteration counts and no statistical analysis, as a zero-dependency way to corroborate known large performance deltas. The timings quoted below were collected on the same machine, with compared runs executed back-to-back, but even so it would be ill-advised to extrapolate quantitative results; they serve only to confirm that the optimizations improve the performance of the targeted code sequences rather than degrade it. Active performance work, of course, demands real benchmarking, which comes with a whole host of subtle issues that it is well worth taking a dependency to manage properly. Andrey Akinshin recently wrote a great blog post discussing this, using the code snippets from Stephen’s post as examples. He will publish a follow-on post to this one with additional benchmarks soon. Thanks Andrey!

Devirtualization

The machine code sequence that the just-in-time compiler emits for a virtual call necessarily involves some degree of indirection, so that the correct target override method can be determined when the machine code executes. Compared to a direct call, this indirection imposes nontrivial overhead. RyuJIT can now identify that certain virtual call sites will always have one particular target override, and replace those virtual calls with direct ones. This avoids the overhead of the virtual indirection and, better still, allows inlining the callee method into the callsite, eliminating call overhead entirely and giving optimizations better insight into the effects of the code. This can happen when the target object has sealed type, or when its allocation site is immediately apparent and thus its exact type is known. This optimization was introduced to RyuJIT in dotnet/coreclr #9230; was subsequently improved by dotnet/coreclr #10192, dotnet/coreclr #10432, and dotnet/coreclr #10471; and has plenty more room for improvement. The PRs for the changes include some statistics (e.g. 7.3% of virtual calls in System.Private.CoreLib get devirtualized) and real-world examples (e.g. this diff in ConcurrentStack.GetEnumerator() — to see the code diff at that link you may have to scroll past the quoted output from jit-diff, which is a tool we use for assessing compiler change impact. It reports any code size increase as a “regression”, though in this case the code size increases are likely from enabling inlines, which is actually an improvement). Here’s a minimal example to illustrate the optimization in action:

	using System;
	using System.Diagnostics; // for Stopwatch
	using System.Runtime.CompilerServices; // for MethodImpl

	public abstract class Operation // abstract unary integer operation
	{
	public abstract int Operate(int input);

	public int OperateTwice(int input) => Operate(Operate(input)); // two virtual calls to Operate
	}

	public sealed class Increment : Operation // concrete, sealed operation: increment by fixed amount
	{
	public readonly int Amount;
	public Increment(int amount = 1) { Amount = amount; }

	public override int Operate(int input) => input + Amount;
	}

	class Test // driver class
	{
	int input; // input for test method PostDoubleIncrememnt
	int output; // output for ^

	[MethodImpl(MethodImplOptions.NoInlining)]
	int PostDoubleIncrement(Increment inc) // Parameter type is sealed Incremement class
	{
	output = inc.OperateTwice(input); // inlining OperateTwice brings in two virtual calls to Operate
	return input; // returns input unchanged, but virtual calls obscure the unchanged-ness
	}

	public static int Main(string[] args)
	{
	var inc = new Increment();
	var test = new Test() { input = 12 };

	while (true)
	{
	var sw = Stopwatch.StartNew();

	for (int i = 0; i < 100000000; i++)
	{
	test.PostDoubleIncrement(inc);
	test.input = test.output;
	}

	Console.WriteLine(sw.Elapsed);
	}
	}
	}

view raw devirt.cs hosted with ❤ by GitHub

Method Operation.OperateTwice takes an instance parameter of abstract type Operation, and makes two virtual calls to its Operate method. When run with the version of the RyuJIT compiler included in .NET Framework 4.6.2, OperateTwice is inlined into Test.PostDoubleIncrement, leaving PostDoubleIncrement with two virtual calls:

	; Assembly listing for method Test:PostDoubleIncrement(ref):int:this
	; Emitting BLENDED_CODE for X64 CPU with SSE2
	; optimized code
	; rsp based frame
	; partially interruptible
	; Final local variable assignments
	;
	; V00 this [V00,T01] ( 5, 5 ) ref -> rdi this
	; V01 arg1 [V01,T00] ( 6, 6 ) ref -> rsi
	; V02 tmp0 [V02,T02] ( 2, 4 ) int -> rdx
	; V03 tmp1 [V03,T03] ( 2, 4 ) int -> rdx
	; V04 OutArgs [V04 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
	;
	; Lcl frame size = 40

	G_M20159_IG01:
	57 push rdi ; set up frame
	56 push rsi
	4883EC28 sub rsp, 40
	488BF9 mov rdi, rcx
	488BF2 mov rsi, rdx

	G_M20159_IG02:
	8B5708 mov edx, dword ptr [rdi+8] ; load this.input
	488BCE mov rcx, rsi ; virtual call sequence (put target object in argument register)
	488B06 mov rax, qword ptr [rsi] ; virtual call sequence (load object's type info)
	488B4040 mov rax, qword ptr [rax+64] ; virtual call sequence (load vtable)
	FF5020 call qword ptr [rax+32]Operation:Operate(int):int:this ; virtual call (indirect via vtable slot)
	8BD0 mov edx, eax
	488BCE mov rcx, rsi ; 2nd virtual call sequence (put target object in argument register)
	488B06 mov rax, qword ptr [rsi] ; 2nd virtual call sequence (load object's type info)
	488B4040 mov rax, qword ptr [rax+64] ; 2nd virtual call sequence (load vtable)
	FF5020 call qword ptr [rax+32]Operation:Operate(int):int:this ; 2nd virtual call (indirect via vtable slot)
	89470C mov dword ptr [rdi+12], eax
	8B4708 mov eax, dword ptr [rdi+8] ; (re-)load this.input, into return register

	G_M20159_IG03:
	4883C428 add rsp, 40 ; tear down frame
	5E pop rsi
	5F pop rdi
	C3 ret ; return

	; Total bytes of code 56, prolog size 6 for method Test:PostDoubleIncrement(ref):int:this
	; ============================================================

view raw devirt_net462.asm hosted with ❤ by GitHub

When run with the version of RyuJIT included in .NET Core 2.0 and .NET Framework 4.7.1, OperateTwice is again inlined into Test.PostDoubleIncrement, but the JIT can now recognize that the instance argument to the two virtual calls pulled in by that inlining is PostDoubleIncrement‘s parameter inc, which is of sealed type Increment. This allows it to rewrite the virtual calls as direct calls to the Incremement.Operate override of Operation.Operate, and even inline those calls into PostDoubleIncrement, which in turn exposes the fact that this code sequence doesn’t modify instance field input, allowing the redundant load of it for the return value to be eliminated:

	; Assembly listing for method Test:PostDoubleIncrement(ref):int:this
	; Emitting BLENDED_CODE for X64 CPU with SSE2
	; optimized code
	; rsp based frame
	; partially interruptible
	; Final local variable assignments
	;
	; V00 this [V00,T01] ( 7, 7 ) ref -> rcx this class-hnd
	; V01 arg1 [V01,T05] ( 4, 4 ) ref -> rdx class-hnd
	; V02 tmp0 [V02,T02] ( 4, 8 ) int -> r8
	; V03 tmp1 [V03,T00] ( 5, 10 ) int -> r8
	;# V04 OutArgs [V04 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
	; V05 cse0 [V05,T03] ( 7, 7 ) int -> rdx
	; V06 cse1 [V06,T04] ( 6, 6 ) int -> rax
	;
	; Lcl frame size = 0

	G_M42303_IG01:

	G_M42303_IG02:
	8B4108 mov eax, dword ptr [rcx+8] ; load this.input directly into return register
	448BC0 mov r8d, eax
	8B5208 mov edx, dword ptr [rdx+8] ; load increment amount
	4403C2 add r8d, edx ; increment once
	4403C2 add r8d, edx ; increment again (no reloads)
	4489410C mov dword ptr [rcx+12], r8d ; store result in this.output

	G_M42303_IG03:
	C3 ret ; return -- input is already in return register

	; Total bytes of code 20, prolog size 0 for method Test:PostDoubleIncrement(ref):int:this
	; ============================================================

view raw devirt_netcoreapp2.0.asm hosted with ❤ by GitHub

The optimized version does of course run faster; here’s what I see running locally on .NET Framework 4.6.2:

00:00:00.7389248
00:00:00.7390185
00:00:00.7343929
00:00:00.7355264
00:00:00.7350114

and here’s what I see running locally on .NET Core 2.0:

00:00:00.4671669
00:00:00.4676545
00:00:00.4683338
00:00:00.4674685
00:00:00.4673269

Enhanced Range Inference

One key goal of JIT compiler optimizations is to reduce the cost of run-time safety checks by eliding the code for them when it can prove they will succeed; this necessarily falls to the JIT since the checks are explicitly dictated by MSIL semantics. RyuJIT’s optimizer accordingly focuses some of its analysis on array index expressions, to see whether it can prove they are in-bounds. @mikedn‘s change dotnet/coreclr #9773 extended this analysis to recognize the common idiom of using an unsigned comparison to check upper and lower bounds in one check ((uint)i < (uint)a.len implies both i >= 0 and i < a.len for signed i). The PR notes how this trimmed the machine code generated for List.Add from 68 bytes to 48 bytes, and here’s a minimal illustrative example:

	using System;
	using System.Diagnostics; // for Stopwatch

	class Test
	{
	int[] data;

	void Set(int index, int datum)
	{
	if ((uint) index >= (uint)data.Length) // validate arguments
	{
	RaiseArgRangeException();
	}

	data[index] = datum; // array access implies bounds check, but preceding code proves it will succeed
	}

	void RaiseArgRangeException() { throw new ArgumentOutOfRangeException(); }

	static Test Instance = new Test() { data = new int[128] };

	public static int Main(string[] args)
	{
	var test = Instance;

	while (true)
	{
	var sw = Stopwatch.StartNew();

	for (int i = 0; i < 300000000; i++)
	{
	test.Set(i & 127, i);
	}

	Console.WriteLine(sw.Elapsed);
	}
	}
	}

view raw range_check.cs hosted with ❤ by GitHub

Method Set validates its index argument, and then stores to the array. The IL generated by the C# compiler for this method looks like so:

	.method private hidebysig instance void Set(int32 index,
	int32 datum) cil managed
	{
	// Code size 27 (0x1b)
	.maxstack 8
	IL_0000: ldarg.1
	IL_0001: ldarg.0
	IL_0002: ldfld int32[] Test::data
	IL_0007: ldlen
	IL_0008: conv.i4
	IL_0009: blt.un.s IL_0011 // Compare-and-branch from source "if ((uint) index >= (uint)data.Length)"
	IL_000b: ldarg.0
	IL_000c: call instance void Test::RaiseArgRangeException()
	IL_0011: ldarg.0
	IL_0012: ldfld int32[] Test::data
	IL_0017: ldarg.1
	IL_0018: ldarg.2
	IL_0019: stelem.i4 // Store array element operation carries implied bounds check
	IL_001a: ret
	} // end of method Test::Set

view raw Set.il hosted with ❤ by GitHub

The IL has a blt.un instruction for the argument validation, and a subsequent stelem instruction for the store that carries an implied bounds check of its own. When run with the version of the RyuJIT compiler included in .NET Framework 4.6.2, machine instructions are generated for each of these checks; here’s what the machine code for the inner loop from the Main method (into which Set gets inlined) looks like:

	G_M45588_IG05:
	448BC8 mov r9d, eax
	4183E17F and r9d, 127
	453BC1 cmp r8d, r9d ; compare from blt.un IL opcode (from source compare)
	7642 jbe SHORT G_M45588_IG08 ; branch from ^

	G_M45588_IG06:
	4C8BD2 mov r10, rdx
	453BC8 cmp r9d, r8d ; compare for bounds check implied by stelem IL opcode
	7343 jae SHORT G_M45588_IG09 ; branch for ^
	4D63C9 movsxd r9, r9d
	4389448A10 mov dword ptr [r10+4*r9+16], eax ; store for stelem IL opcode
	FFC0 inc eax
	3D00A3E111 cmp eax, 0x11E1A300
	7CDB jl SHORT G_M45588_IG05

view raw range_check_net462.asm hosted with ❤ by GitHub

When run with the version of RyuJIT included in .NET Core 2.0 and .NET Framework 4.7.1, on the other hand, the compiler recognizes that the explicit check for the argument validation ensures that the subsequent check for the stelem instruction will always succeed, and omits the redundant check, producing this machine code:

	G_M32275_IG05:
	448BC1 mov r8d, ecx
	4183E07F and r8d, 127
	413BD0 cmp edx, r8d ; compare from blt.un IL opcode (from source compare)
	7641 jbe SHORT G_M32275_IG08 ; branch from ^

	G_M32275_IG06:
	; No bounds check here -- optimizer proved it would succeed and elided it
	4C8BC8 mov r9, rax
	4D63C0 movsxd r8, r8d
	43894C8110 mov dword ptr [r9+4*r8+16], ecx ; store for stelem IL opcode
	FFC1 inc ecx
	81F900A3E111 cmp ecx, 0x11E1A300
	7CDF jl SHORT G_M32275_IG05

view raw range_check_netcoreapp2.0.asm hosted with ❤ by GitHub

Importantly, this brings the machine code in line with what one might expect from looking at the source code — a check for argument validation, followed by a store to the backing array. Also, executing the code reports a speedup as expected — running on .NET Framework 4.6.2 gives me output like this:

00:00:00.4313988
00:00:00.4313209
00:00:00.4320729
00:00:00.4319180
00:00:00.4316375

and running on .NET Core 2.0 gives me output like this:

00:00:00.3235982
00:00:00.3237021
00:00:00.3250067
00:00:00.3235947
00:00:00.3236944

Finally Cloning

When it comes to exception handling and performance, one key goal is to minimize the cost that exception handling constructs impose on the non-exception path — if no exceptions are actually raised when the program runs, then (as much as possible) it should run as fast as it would if it didn’t have exception handlers at all. This poses a challenge for finally clauses, which execute on both exception and non-exception paths. In order to correctly support the exception path, the code of the finally must be bracketed by some set-up/tear-down code that facilitates being called from the runtime code that handles exception dispatch. Let’s look at an example:

	using System;
	using System.Diagnostics; // for Stopwatch
	using System.Runtime.CompilerServices; // for MethodImpl

	class Test
	{
	[MethodImpl(MethodImplOptions.NoInlining)]
	static void Update(ref int left, ref int right)
	{
	try
	{
	left = checked(left + 1); // incrememnt `left` (just to have something to do inside `try`), with checked arithmetic so exception path exists
	}
	finally
	{
	right = right + 1; // increment `right` (just to have something to do inside `finally`)
	}
	}

	public static int Main(string[] args)
	{
	while (true)
	{
	int left = 0, right = 0;
	var sw = Stopwatch.StartNew();

	for (int i = 0; i < 100000000; i++)
	{
	Update(ref left, ref right);
	}

	Console.WriteLine(sw.Elapsed);
	}
	}
	}

view raw finally.cs hosted with ❤ by GitHub

Method Update has a finally clause that increments ref parameter right. The actual increment boils down to a single machine instruction (add dword ptr [rax], 1), but interaction with the runtime’s exception dispatch mechanism requires 5 extra instructions prior and 3 instructions after. The exception dispatch code invokes the finally handler by calling it, and with the version of RyuJIT included with .NET Framework 4.6.2, the non-exception path of method Update similarly uses a call instruction to transfer control to the finally code. Here’s what the machine code for method Update looks like with that version of the compiler:

	; Assembly listing for method Test:Update(byref,byref)
	; Emitting BLENDED_CODE for X64 CPU with SSE2
	; optimized code
	; rbp based frame
	; fully interruptible
	; Final local variable assignments
	;
	; V00 arg0 [V00,T00] ( 4, 4 ) byref -> rcx
	; V01 arg1 [V01,T01] ( 4, 4 ) byref -> [rbp+0x18] do-not-enreg[H]
	; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
	; V03 PSPSym [V03 ] ( 1, 1 ) long -> [rbp-0x10] do-not-enreg[X] addr-exposed
	;
	; Lcl frame size = 48

	G_M62690_IG01:
	55 push rbp ; set up call frame
	4883EC30 sub rsp, 48
	488D6C2430 lea rbp, [rsp+30H]
	488965F0 mov qword ptr [rbp-10H], rsp
	48895518 mov bword ptr [rbp+18H], rdx ; store pointer to `right` in frame

	G_M62690_IG02:
	8B01 mov eax, dword ptr [rcx]
	83C001 add eax, 1 ; compute `left + 1`
	7004 jo SHORT G_M62690_IG03 ; raise exception on overflow
	8901 mov dword ptr [rcx], eax ; store new value to `left`
	EB06 jmp SHORT G_M62690_IG04

	G_M62690_IG03:
	E8CEF3AB5F call CORINFO_HELP_OVERFLOW
	CC int3

	G_M62690_IG04:
	488BCC mov rcx, rsp
	E807000000 call G_M62690_IG07 ; non-excpetion path: call finally handler

	G_M62690_IG05:
	90 nop

	G_M62690_IG06:
	488D6500 lea rsp, [rbp] ; tear down call frame
	5D pop rbp
	C3 ret ; return

	;; Code above is all of method `Update` except for the `finally` clause

	;; Code below is all the `finally` clause of method `Update`
	G_M62690_IG07:
	55 push rbp ; set up a new frame for the `finally` handler
	4883EC30 sub rsp, 48
	488B6920 mov rbp, qword ptr [rcx+32] ; find the original frame for the `Update` method
	48896C2420 mov qword ptr [rsp+20H], rbp
	488D6D30 lea rbp, [rbp+30H]

	G_M62690_IG08:
	488B4518 mov rax, bword ptr [rbp+18H] ; load pointer to `right` from parent frame
	830001 add dword ptr [rax], 1 ; increment `right`

	G_M62690_IG09:
	4883C430 add rsp, 48 ; tear down handler frame
	5D pop rbp
	C3 ret ; return (to either exception dispatch code or `Update`)

	; Total bytes of code 81, prolog size 18 for method Test:Update(byref,byref)
	; ============================================================

view raw finally_net462.asm hosted with ❤ by GitHub

Thanks to change dotnet/coreclr #8551, the version of RyuJIT included with .NET Core 2.0 makes a separate copy of the finally handler body, which executes on the non-exception path, and only on the non-exception path, and therefore doesn’t need any of the code that interacts with the exception dispatch code. The result (for this simple finally) is a simple inc dword ptr [rax] in lieu of the call to the finally handler code. Here’s what the machine code for method Update looks like on .NET Core 2.0:

	; Assembly listing for method Test:Update(byref,byref)
	; Emitting BLENDED_CODE for X64 CPU with SSE2
	; optimized code
	; rbp based frame
	; fully interruptible
	; Final local variable assignments
	;
	; V00 arg0 [V00,T01] ( 4, 4 ) byref -> rcx
	; V01 arg1 [V01,T00] ( 6, 4 ) byref -> [rbp+0x18] do-not-enreg[H]
	; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
	; V03 PSPSym [V03 ] ( 1, 1 ) long -> [rbp-0x10] do-not-enreg[X] addr-exposed
	;
	; Lcl frame size = 48

	G_M16352_IG01:
	55 push rbp ; set up call frame
	4883EC30 sub rsp, 48
	488D6C2430 lea rbp, [rsp+30H]
	488965F0 mov qword ptr [rbp-10H], rsp
	48895518 mov bword ptr [rbp+18H], rdx ; store pointer to `right` in frame

	G_M16352_IG02:
	8B01 mov eax, dword ptr [rcx]
	83C001 add eax, 1 ; compute `left + 1`
	7004 jo SHORT G_M16352_IG03 ; raise exception on overflow
	8901 mov dword ptr [rcx], eax ; store new value to `left`
	EB06 jmp SHORT G_M16352_IG04

	G_M16352_IG03:
	E80E0F395F call CORINFO_HELP_OVERFLOW
	CC int3

	G_M16352_IG04:
	488B4518 mov rax, bword ptr [rbp+18H] ; load pointer to `right` (not really necessary; future optimizations should clean this up)
	FF00 inc dword ptr [rax] ; non-exception path: increment `right`

	G_M16352_IG05:
	488D6500 lea rsp, [rbp] ; tear down call frame
	5D pop rbp
	C3 ret ; return

	;; Code above is all of method `Update` except for the `finally` clause

	;; Code below is all the `finally` clause of method `Update`
	G_M16352_IG06:
	55 push rbp ; set up a new frame for the `finally` handler
	4883EC30 sub rsp, 48
	488B6920 mov rbp, qword ptr [rcx+32] ; find the original frame for the `Update` method
	48896C2420 mov qword ptr [rsp+20H], rbp
	488D6D30 lea rbp, [rbp+30H]

	G_M16352_IG07:
	488B4518 mov rax, bword ptr [rbp+18H] ; load pointer to `right` from parent frame
	FF00 inc dword ptr [rax] ; increment `right`

	G_M16352_IG08:
	4883C430 add rsp, 48 ; tear down handler frame
	5D pop rbp
	C3 ret ; return (to exception dispatch code)

	; Total bytes of code 77, prolog size 18 for method Test:Update(byref,byref)
	; ============================================================

view raw finally_netcoreapp2.0.asm hosted with ❤ by GitHub

(Note: As mentioned above, .NET Core and .NET Framework share RyuJIT compiler sources. In the case of this particular optimization, however, since the Thread.Abort mechanism that exists in .NET Framework but not .NET Core requires the optimization to perform extra work that’s not yet implemented, the compiler includes a check that disables this optimization when running on .NET Framework.)

It’s worth noting that, in terms of C# source code, this optimization applies not just to finally statements, but also to other constructs which are implemented using MSIL finally clauses, such as using statements and foreach statements involving enumerators that implement IDisposable.

As usual, the PR reports some stats (e.g. 3,000 affected methods in frameworks libraries). The example above gives me output like this running on .NET Framework 4.6.2:

00:00:00.8864647
00:00:00.8871649
00:00:00.8858654
00:00:00.8844547
00:00:00.8863496

and output like this running on .NET Core 2.0:

00:00:00.3945198
00:00:00.3943679
00:00:00.3954488
00:00:00.3944719
00:00:00.3948235
00:00:00.3942550
00:00:00.3943774

Shift Count Mask Removal

Generating machine code for bit-shift operations is surprisingly nuanced. Many software languages and hardware ISAs (and compiler intermediate representations) include bit-shift instructions, but in the case that the nominal shift amount is greater than or equal to the number of bits in the shifted value’s type, they have differing conventions (as do different programmer’s expectations): some interpret the shift amount modulo the number of bits, some produce the value zero (all bits “shifted out”), and some leave the result unspecified. MSIL’s shl and shr.un instructions’ results are undefined in these cases (perhaps to allow the JIT compiler to simply lower these to corresponding target machine instructions regardless of the target’s convention). C#’s << and >> operators, on the other hand, always have a defined result, interpreting the shift amount modulo the number of bits. To ensure the correct semantics, therefore, the C# compiler must emit explicit MSIL instructions to perform the modulus/bit-mask operation on the shift amount before feeding it to shl/shr. Since the x86, x64, and ARM64 ISAs’ shift instructions likewise interpret the shift amount modulo the number of bits, on these targets a single hardware instruction can be used for the whole mask+shift sequence emitted by the C# compiler. @mikedn‘s change dotnet/coreclr #11594 taught RyuJIT to recognize these sequences and shrink them down appropriately. The PR reports stats showing this firing in 20 different framework assemblies, and, as always, a minimal illustrative example follows:

	using System;
	using System.Diagnostics; // for Stopwatch

	class Test
	{
	static int LeftShift(int bits, int amount) => bits << amount;

	public static int Main(string[] args)
	{
	while (true)
	{
	int a = 1, b = 1, c = 1, d = 1;
	var sw = Stopwatch.StartNew();

	for (int i = 0; i < 200000000; ++i)
	{
	// Shift several times per iteration so the loop branching
	// overhead doesn't hide the cost of the shifts.
	a = LeftShift(a, d);
	b = LeftShift(b, a);
	c = LeftShift(c, b);
	d = LeftShift(d, c);
	}
	_ = a + b + c + d;

	Console.WriteLine(sw.Elapsed);
	}
	}
	volatile static int _;
	}

view raw shift.cs hosted with ❤ by GitHub

Method LeftShift uses only C#’s << operator, and its IL includes the modulo operation (as ldc 31 + and) to ensure those semantics:

	.method private hidebysig static int32 LeftShift(int32 bits,
	int32 amount) cil managed
	{
	// Code size 7 (0x7)
	.maxstack 8
	IL_0000: ldarg.0
	IL_0001: ldarg.1
	IL_0002: ldc.i4.s 31 // load bitmask
	IL_0004: and // mask shift amount
	IL_0005: shl // perform shift
	IL_0006: ret
	} // end of method Test::LeftShift

view raw LeftShift.il hosted with ❤ by GitHub

Running this with the version of the RyuJIT compiler included with .NET Framework 4.6.2, these masking operations are mechanically translated to hardware and instructions; here’s what the inner loop of method Main (into which LeftShift gets inlined) looks like:

	G_M22508_IG04:
	8BCD mov ecx, ebp
	83E11F and ecx, 31
	D3E6 shl esi, cl
	8BCE mov ecx, esi
	83E11F and ecx, 31
	D3E7 shl edi, cl
	8BCF mov ecx, edi
	83E11F and ecx, 31
	D3E3 shl ebx, cl
	8BCB mov ecx, ebx
	83E11F and ecx, 31
	D3E5 shl ebp, cl
	FFC0 inc eax
	3D00C2EB0B cmp eax, 0xBEBC200
	7CDB jl SHORT G_M22508_IG04

view raw shift_net462.asm hosted with ❤ by GitHub

Running this with the RyuJIT compiler built from the current master branch (this particular change was merged after forking the .NET Core 2.0 release branch, and so will be included in versions after 2.0), on the other hand, the redundant masking is eliminated:

	G_M65390_IG04:
	8BCD mov ecx, ebp
	D3E6 shl esi, cl
	8BCE mov ecx, esi
	D3E7 shl edi, cl
	8BCF mov ecx, edi
	D3E3 shl ebx, cl
	8BCB mov ecx, ebx
	D3E5 shl ebp, cl
	FFC0 inc eax
	3D00C2EB0B cmp eax, 0xBEBC200
	7CE7 jl SHORT G_M65390_IG04

view raw shift_netcoreapp2.0.asm hosted with ❤ by GitHub

Running this example code on .NET Framework 4.6.2, I see output like this:

00:00:00.8666592
00:00:00.8644551
00:00:00.8623416
00:00:00.8625029
00:00:00.8621675

Running it on .NET Core built from current master branch, I see output like this:

00:00:00.5767756
00:00:00.5747216
00:00:00.5753256
00:00:00.5747212
00:00:00.5751126

Conclusion

There’s a lot of work going on in the JIT. I hope this small sampling has provided a fun read, and invite anyone interested to join the community pushing this work forward; there’s some documentation on RyuJIT available, and active work is typically labelled codegen and/or optimization. Performance is a constant focus in JIT work, and we’ve got some exciting improvements in the pipeline (like tiered jitting), so stay tuned, and let us know what would really help light up your scenarios!