Update (2017/06/12): Added BenchmarkDotNet blog post link.
There are many exciting aspects to .NET Core (open source, cross platform, x-copy deployable, etc.) that have been covered in posts on this blog before. To me, though, one of the most exciting aspects of .NET Core is performance. There’s been a lot of discussion about the significant advancements that have been made in ASP.NET Core performance, its status as a top contender on various TechEmpower benchmarks, and the continual advancements being made in pushing it further. However, there’s been much less discussion about some equally exciting improvements throughout the runtime and the base class libraries.
There are way too many improvements to mention. After all, as an open source project that’s very accepting of contributions, Microsoft and community developers from around the world have found places where performance is important to them and submitted pull requests to improve things. I’d like to thank all the community developers for their .NET Core contributions, some of which are specifically called out in this post. We expect that many of these improvements will be brought to the .NET Framework over the next few releases, too. For this post, I’ll provide a tour through just a small smattering of the performance improvements you’ll find in .NET Core, and in particular in .NET Core 2.0, focusing on a few examples from a variety of the core libraries.
NOTE: This blog post contains lots of example code and timings. As with any such timings, take them with a grain of salt: these were taken on one machine in one configuration (all 64-bit processes), and so you may see different results on different systems. However, I ran each test on .NET Framework 4.7 and .NET Core 2.0 on the same machine in the same configuration at approximately the same time, providing a consistent environment for each comparison. Further, normally such testing is best done with a tool like BenchmarkDotNet; I’ve not done so for this post simply to make it easy for you to copy-and-paste the samples out into a console app and try them.
Editor: See the excellent follow-up by Andrey Akinshin where he Measures Performance Improvements in .NET Core with BenchmarkDotNet.
Collections
Collections are the bedrock of any application, and there are a multitude of collections available in the .NET libraries. Not every operation on every collection has been made faster, but many have. Some of these improvements are due to eliminating overheads, such as streamlining operations to enable better inlining, reducing instruction count, and so on. For example, consider this small example with a Queue<T>:
PR dotnet/corefx #2515 from OmariO removed from Enqueue and Dequeue a relatively expensive modulus operation that dominated the costs of these operations. On my machine, this code on .NET 4.7 produces output like this:
00:00:00.9392595
00:00:00.9390453
00:00:00.9455784
00:00:00.9508294
00:00:01.0107745
whereas with .NET Core 2.0 it produces output like this:
00:00:00.5514887
00:00:00.5662477
00:00:00.5627481
00:00:00.5685286
00:00:00.5262378
As this is “wall clock” time elapsed, smaller values are better, and this shows an ~2x increase in throughput!
In other cases, operations have been made faster by changing the algorithmic complexity of an operation. It’s often best when writing software to first write a simple implementation, one that’s easily maintained and easily proven correct. However, such implementations often don’t exhibit the best possible performance, and it’s not until a specific scenario comes along that drives a need to improve performance does that happen. For example, SortedSet<T>‘s ctor was originally written in a relatively simple way that didn’t scale well due to (I assume accidentally) employing an O(N^2) algorithm for handling duplicates. The algorithm was fixed in .NET Core in PR dotnet/corefx #1955. The following short program exemplifies the difference the fix made:
On my system, on .NET Framework this code takes ~7.7 seconds to execute. On .NET Core 2.0, that is reduced to ~0.013s, for an ~600x improvement (at least with 400K elements… as the fix changed the algorithmic complexity, the larger the set, the more the times will diverge).
Or consider this example on SortedSet<T>:
The implementation of Min and Max in .NET 4.7 walks the whole tree underlying the SortedSet<T>, but that’s unnecessary for finding just the min or the max, as the implementation can traverse down to just the relevant node. PR dotnet/corefx #11968 fixes the .NET Core implementation to do just that. On .NET 4.7, this example produces results like:
00:00:01.1427246
00:00:01.1295220
00:00:01.1350696
00:00:01.1502784
00:00:01.1677880
whereas on .NET Core 2.0, we get results like:
00:00:00.0861391
00:00:00.0861183
00:00:00.0866616
00:00:00.0848434
00:00:00.0860198
showing a sizeable decrease in time and increase in throughput.
Even a core workhorse like List<T> has found room for improvement. Due to JIT improvements and PRs like dotnet/coreclr #9539 from benaadams, core operations like List<T>.Add have gotten faster. Consider this small example:
On .NET 4.7, I get results like:
00:00:00.4434135
00:00:00.4394329
00:00:00.4496867
00:00:00.4496383
00:00:00.4515505
and with .NET Core 2.0, I see:
00:00:00.3213094
00:00:00.3211772
00:00:00.3179631
00:00:00.3198449
00:00:00.3164009
To be sure, the fact that we can do 100 million such adds and removes from a list like this in just 0.3 seconds highlights that the operation wasn’t slow to begin with. But over the execution of an app, lists are often added to a lot, and the savings add up.
These kinds of collections improvements expand beyond just the System.Collections.Generic namespace; System.Collections.Concurrent has had many improvements as well. In fact, both ConcurrentQueue<T> and ConcurrentBag<T> were essentially completely rewritten for .NET Core 2.0, in PRs dotnet/corefx #14254 and dotnet/corefx #14126, respectively. Let’s look at a basic example, using ConcurrentQueue<T> but without any concurrency, essentially the same example as earlier with Queue<T> but with ConcurrentQueue<T> instead:
On my machine on .NET 4.7, this yields output like the following:
00:00:02.6485174
00:00:02.6144919
00:00:02.6699958
00:00:02.6441047
00:00:02.6255135
Obviously the ConcurrentQueue<T> example on .NET 4.7 is slower than the Queue<T> version on .NET 4.7, as ConcurrentQueue<T> needs to employ synchronization to ensure it can be used safely concurrently. But the more interesting comparison is what happens when we run the same code on .NET Core 2.0:
00:00:01.7700190
00:00:01.8324078
00:00:01.7552966
00:00:01.7518632
00:00:01.7560811
This shows that the throughput using ConcurrentQueue<T> without any concurrency improves when switching to .NET Core 2.0 by ~30%. But there are even more interesting aspects. The changes in the implementation improved serialized throughput, but even more so reduced the synchronization between producers and consumers using the queue, which can have a more demonstrable impact on throughput. Consider the following code instead:
This example is spawing a consumer that sits in a tight loop dequeueing any elements it can find, until it consumes everything the producer adds. On .NET 4.7, this outputs results on my machine like the following:
00:00:06.1366044
00:00:05.7169339
00:00:06.3870274
00:00:05.5487718
00:00:06.6069291
whereas with .NET Core 2.0, I see results like the following:
00:00:01.2052460
00:00:01.5269184
00:00:01.4638793
00:00:01.4963922
00:00:01.4927520
That’s an ~3.5x throughput increase. But better CPU efficiency isn’t the only impact of the rewrite; memory allocation is also substantially decreased. Consider a small variation to the original test, this time looking at the number of GC collections instead of the wall-clock time:
On .NET 4.7, I get output like the following:
Gen0=162 Gen1=80 Gen2=0
Gen0=162 Gen1=81 Gen2=0
Gen0=162 Gen1=81 Gen2=0
Gen0=162 Gen1=81 Gen2=0
Gen0=162 Gen1=81 Gen2=0
whereas with .NET Core 2.0, I get output like the following:
Gen0=0 Gen1=0 Gen2=0
Gen0=0 Gen1=0 Gen2=0
Gen0=0 Gen1=0 Gen2=0
Gen0=0 Gen1=0 Gen2=0
Gen0=0 Gen1=0 Gen2=0
That’s not a typo: 0 collections. The implementation in .NET 4.7 employs a linked list of fixed-size arrays that are thrown away once the fixed number of elements are added to each; this helps to simplify the implementation, but results in lots of garbage being generated for the segments. In .NET Core 2.0, the new implementation still employs a linked list of segments, but these segments increase in size as new segments are added, and more importantly, utilize circular buffers, such that new segments only need be added if the previous segment is entirely full (though other operations on the collection, such as enumeration, can also cause the current segments to be frozen and force new segments to be created in the future). Such reductions in allocation can have a sizeable impact on the overall performance of an application.
Similar improvements surface with ConcurrentBag<T>. ConcurrentBag<T> maintains thread-local work-stealing queues, such that every thread that adds to the bag has its own queue. In .NET 4.7, these queues are implemented as linked lists of one node per element, which means that any addition to the bag incurs an allocation. In .NET Core 2.0, these queues are now arrays, which means that other than the amortized costs involved in growing the arrays, additions are allocation-free. This can be seen in the following repro:
On .NET 4.7, this yields the following output on my machine:
Elapsed=00:00:06.5672723 Gen0=953 Gen1=0 Gen2=0
Elapsed=00:00:06.4829793 Gen0=954 Gen1=1 Gen2=0
Elapsed=00:00:06.9008532 Gen0=954 Gen1=0 Gen2=0
Elapsed=00:00:06.6485667 Gen0=953 Gen1=1 Gen2=0
Elapsed=00:00:06.4671746 Gen0=954 Gen1=1 Gen2=0
whereas with .NET Core 2.0 I get:
Elapsed=00:00:04.3377355 Gen0=0 Gen1=0 Gen2=0
Elapsed=00:00:04.2892791 Gen0=0 Gen1=0 Gen2=0
Elapsed=00:00:04.3101593 Gen0=0 Gen1=0 Gen2=0
Elapsed=00:00:04.2652497 Gen0=0 Gen1=0 Gen2=0
Elapsed=00:00:04.2808077 Gen0=0 Gen1=0 Gen2=0
That’s an ~30% improvement in throughput, and a huge (complete) reduction in allocations and resulting garbage collections.
LINQ
In application code, collections often go hand-in-hand with Language Integrated Query (LINQ), which has seen even more improvements. Many of the operators in LINQ have been entirely rewritten for .NET Core in order to reduce the number and size of allocations, reduce algorithmic complexity, and generally eliminate unnecessary work.
For example, the Enumerable.Concat method is used to create a single IEnumerable<T> that first yields all of the elements of one enumerable and then all the elements of a second. Its implementation in .NET 4.7 is simple and easy to understand, reflecting exactly this statement of behavior:
This is about as good as you can expect when the two sequences are simple enumerables like those produced by an iterator in C#. But what if application code instead had code like the following?
Every time we yield out of an iterator, we return out of the enumerator’s MoveNext method. That means if you yield an element from enumerating another iterator, you’re returning out of two MoveNext methods, and moving to the next element requires calling back into both of those MoveNext methods. The more enumerators you need to call into, the longer the operation takes, especially since every one of those operations involves multiple interface calls (MoveNext and Current). That means that concatenating multiple enumerables grows exponentially rather than linearly with the number of enumerables involved. PR dotnet/corefx #6131 fixed that, and the difference is obvious in the following example, which concatenates 10K enumerables of 10 elements each:
On my machine on .NET 4.7, this takes ~4.12 seconds. On my machine on .NET Core 2.0, this takes only ~0.14 seconds, for an ~30x improvement.
Other operators have been improved substantially by eliminating overheads involved when various operators are used together. For example, a multitude of PRs from JonHanna have gone into optimizing various such cases and into making it easier to add more cases in the future. Consider this example:
Here we create an enumerable of the numbers 10,000,000 down to 0, and then time how long it takes to sort them ascending, skip the first 4 elements of the sorted result, and grab the fifth one (which will be 4, as the sequence starts at 0). On my machine on .NET 4.7, I get output like:
00:00:01.3879042
00:00:01.3438509
00:00:01.4141820
00:00:01.4248908
00:00:01.3548279
whereas with .NET Core 2.0, I get output like:
00:00:00.1776617
00:00:00.1787467
00:00:00.1754809
00:00:00.1765863
00:00:00.1735489
That’s a sizeable improvement (~8x), in this case due primarily (though not exclusively) to PR dotnet/corefx #2401, which avoids most of the costs of the sort.
Similarly, PR dotnet/corefx #3429 from justinvp added optimizations around the common ToList method, providing optimized paths for when the source had a known length, and plumbing that through operators like Select. The impact of this is evident in a simple test like the following:
On .NET 4.7, this provides results like:
00:00:00.1308687
00:00:00.1228546
00:00:00.1268445
00:00:00.1247647
00:00:00.1503511
whereas on .NET Core 2.0, I get results like the following:
00:00:00.0386857
00:00:00.0337234
00:00:00.0346344
00:00:00.0345419
00:00:00.0355355
showing an ~4x increase in throughput.
In other cases, the performance wins have come from streamlining the implementation to avoid overheads, such as reducing allocations, avoiding delegate allocations, avoiding interface calls, minimizing field reads and writes, avoiding copies, and so on. For example, jamesqo contributed PR dotnet/corefx #11208, which substantially reduced overheads involved in Enumerable.ToArray, in particular by better managing how the internal buffer(s) used grow to accommodate the unknown amount of data being aggregated. To see this, consider this simple example:
On .NET 4.7, I get results like:
Elapsed=00:00:01.0548794 Gen0=2 Gen1=2 Gen2=2
Elapsed=00:00:01.1147146 Gen0=2 Gen1=2 Gen2=2
Elapsed=00:00:01.0709146 Gen0=2 Gen1=2 Gen2=2
Elapsed=00:00:01.0706030 Gen0=2 Gen1=2 Gen2=2
Elapsed=00:00:01.0620943 Gen0=2 Gen1=2 Gen2=2
and on .NET Core 2.0, results like:
Elapsed=00:00:00.1716550 Gen0=1 Gen1=1 Gen2=1
Elapsed=00:00:00.1720829 Gen0=1 Gen1=1 Gen2=1
Elapsed=00:00:00.1717145 Gen0=1 Gen1=1 Gen2=1
Elapsed=00:00:00.1713335 Gen0=1 Gen1=1 Gen2=1
Elapsed=00:00:00.1705285 Gen0=1 Gen1=1 Gen2=1
so for this example ~6x faster with half the garbage collections.
There are over a hundred operators in LINQ, and while I’ve only mentioned a few, many of them have been subject to these kinds of improvements.
Compression
The examples shown thus far, of collections and LINQ, have been about manipulating data in memory. There are of course many other forms of data manipulation, including transformations that are heavily CPU-bound in nature. Investments have also been made in improving such operations.
One key example is compression, such as with DeflateStream, and several impactful performance changes have gone in here. For example, in .NET 4.7, zlib (a native compression library) is used for compressing data, but a relatively unoptimized managed implementation is used for decompressing data; PR dotnet/corefx #2906 added .NET Core support for using zlib for decompression as well. And PR dotnet/corefx #5674 from bjjones enabled using a more optimized version of zlib produced by Intel. These combine to a fairly dramatic effect. Consider this example, which just creates a large array of (fairly compressible) data:
On .NET 4.7, for this one compression/decompression operation I get results like:
00:00:00.7977190
whereas with .NET Core 2.0, I get results like:
00:00:00.1926701
Cryptography
Another common source of compute in a .NET application is the use of cryptographic operations. Improvements can be seen here as well. For example, in .NET 4.7, SHA256.Create returns a SHA256 type implemented in managed code, and while managed code can be made to run very fast, for very compute-bound computations it’s still hard to compete with the raw throughput and compiler optimizations available to code written in C/C++. In contrast, for .NET Core 2.0, SHA256.Create returns an implementation based on the underlying operating system, e.g. using CNG on Windows or OpenSSL on Unix. The impact can be seen in this simple example that hashes a 100MB byte array:
On .NET 4.7, I get:
00:00:00.7576808
whereas with .NET Core 2.0, I get:
00:00:00.4032290
Another nice improvement for zero code changes.
Math
Mathematical operations are also a large source of computation, especially when dealing with large numbers. Through PRs like dotnet/corefx #2182, axelheer made some substantial improvements to various operations on BigInteger. Consider the following example:
On my machine on .NET 4.7, this outputs results like:
00:00:05.6024158
The same code on .NET Core 2.0 instead outputs results like:
00:00:01.2707089
This is another great example of a developer caring a lot about a particular area of .NET and helping to make it better for their own needs and for everyone else that might be using it.
Even some math operations on core integral types have been improved. For example, consider:
PR dotnet/coreclr #8125 replaced DivRem with a faster implementation, such that on .NET 4.7 I get results like:
00:00:01.4143100
and on .NET Core 2.0 I get results like:
00:00:00.7469733
for an ~2x improvement in throughput.
Serialization
Binary serialization is another area of .NET that can be fairly CPU/data/memory intensive. BinaryFormatter is a component that was initially left out of .NET Core, but it reappears in .NET Core 2.0 in support of existing code that needs it (in general, other forms of serialization are recommended for new code). The component is almost an identical port of the code from .NET 4.7, with the exception of tactical fixes that have been made to it since, in particular around performance. For example, PR dotnet/corefx #17949 is a one-line fix that increases the maximum size that a particular array is allowed to grow to, but that one change can have a substantial impact on throughput, by allowing for an O(N) algorithm to operate for much longer than it previously would have before switching to an O(N^2) algorithm. This is evident in the following code example:
On .NET 4.7, this code outputs results like:
76.677144
whereas on .NET Core 2.0, it outputs results like:
6.4044694
showing an ~12x throughput improvement for this case. In other words, it’s able to deal with much larger serialized inputs more efficiently.
Text Processing
Another very common form of computation in .NET applications is the processing of text, and a large number of improvements have gone in here, at various levels of the stack.
Consider Regex. This type is commonly used to validate and parse data from input text. Here’s an example that uses Regex.IsMatch to repeatedly match phone numbers:
On my machine on .NET 4.7, I get results like:
Elapsed=00:00:05.4367262 Gen0=820 Gen1=0 Gen2=0
whereas with .NET Core 2.0, I get results like:
Elapsed=00:00:04.0231373 Gen0=248
That’s an ~25% improvement in throughput and an ~70% reduction in allocation / garbage collections, due to a small change in PR dotnet/corefx #231 that made a fix to how some data is cached.
Another example of text processing is in various forms of encoding and decoding, such as URL decoding via WebUtility.UrlDecode. It’s often the case in decoding methods like this one that the input doesn’t actually need any decoding, but the input is still passed through the decoder in case it does. Thanks to PR dotnet/corefx #7671 from hughbe, this case has been optimized. So, for example, with this program:
on .NET 4.7, I see the following output:
Elapsed=00:00:01.6742583 Gen0=648
whereas on .NET Core 2.0, I see this output:
Elapsed=00:00:01.2255288 Gen0=133
Other forms of encoding and decoding have also been improved. For example, dotnet/coreclr #10124 optimized the loops involved in using some of the built-in Encoding-derived types. So, for example, this code that repeatedly encodes an ASCII input string as UTF8:
on .NET 4.7 produces output for me like:
00:00:02.4028829
00:00:02.3743152
00:00:02.3401392
00:00:02.4024785
00:00:02.3550876
and on .NET Core 2.0 produces output for me like:
00:00:01.6133550
00:00:01.5915718
00:00:01.5759625
00:00:01.6070851
00:00:01.6070767
These kinds of improvements extend as well to general Parse and ToString methods in .NET for converting between strings and other representations. For example, it’s fairly common to use enums to represent various kinds of state, and to use Enum.Parse to parse a string into a corresponding Enum. PR dotnet/coreclr #2933 helped to improve this. Consider the following code:
On .NET 4.7, I get results like:
Elapsed=00:00:00.9529354 Gen0=293
Elapsed=00:00:00.9422960 Gen0=294
Elapsed=00:00:00.9419024 Gen0=294
Elapsed=00:00:00.9417014 Gen0=294
Elapsed=00:00:00.9514724 Gen0=293
and on .NET Core 2.0, I get results like:
Elapsed=00:00:00.6448327 Gen0=11
Elapsed=00:00:00.6438907 Gen0=11
Elapsed=00:00:00.6285656 Gen0=12
Elapsed=00:00:00.6286561 Gen0=11
Elapsed=00:00:00.6294286 Gen0=12
so not only an ~33% improvement in throughput, but also an ~25x reduction in allocations and associated garbage collections.
Or consider PRs dotnet/coreclr #7836 and dotnet/coreclr #7891, which improved DateTime.ToString with formats “o” (the round-trip date/time pattern) and “r” (the RFC1123 pattern), respectively. The result is that given code like this:
on .NET 4.7 I see output like:
Elapsed=00:00:03.7552059 Gen0=949
Elapsed=00:00:03.6992357 Gen0=950
Elapsed=00:00:03.5459498 Gen0=950
Elapsed=00:00:03.5565029 Gen0=950
Elapsed=00:00:03.5388134 Gen0=950
and on .NET Core 2.0 output like:
Elapsed=00:00:01.3588804 Gen0=87
Elapsed=00:00:01.3932658 Gen0=88
Elapsed=00:00:01.3607030 Gen0=88
Elapsed=00:00:01.3675958 Gen0=87
Elapsed=00:00:01.3546522 Gen0=88
That’s an almost 3x increase in throughput and a whopping ~90% reduction in allocations / garbage collections.
Of course, there’s lots of custom text processing done in .NET applications, beyond using built in types like Regex/Encoding and built-in operations like Parse and ToString, often built directly on top of string, and lots of improvements have gone into operations on String itself.
For example, String.IndexOf is very commonly used to find characters in strings. IndexOf was improved in dotnet/coreclr #5327 by bbowyersmyth, who’s submitted a bunch of performance improvements for String. So this example:
on .NET 4.7 produces results for me like this:
00:00:05.9718129
00:00:05.9199793
00:00:06.0203108
00:00:05.9458049
00:00:05.9622262
whereas on .NET Core 2.0 it produces results for me like this:
00:00:03.1283763
00:00:03.0925150
00:00:02.9778923
00:00:03.0782851
for an ~2x improvement in throughput.
Or consider comparing strings. Here’s an example that uses String.StartsWith and ordinal comparisons:
Thanks to dotnet/coreclr #2825, on .NET 4.7 I get results like:
00:00:01.3097317
00:00:01.3072381
00:00:01.3045015
00:00:01.3068244
00:00:01.3210207
and on .NET Core 2.0 results like:
00:00:00.6239002
00:00:00.6150021
00:00:00.6147173
00:00:00.6129136
00:00:00.6099822
It’s quite fun looking through all of the changes that have gone into String, seeing their impact, and thinking about the additional possibilities for more improvements.
File System
Thus far I’ve been focusing on various improvements around manipulating data in memory. But lots of the changes that have gone into .NET Core have been about I/O.
Let’s start with files. Here’s an example of asynchronously reading all of the data from one file and writing it to another (using FileStreams configured to use async I/O):
A bunch of PRs have gone into reducing the overheads involved in FileStream, such as dotnet/corefx #11569 which adds a specialized CopyToAsync implementation, and dotnet/corefx #2929 which improves how asynchronous writes are handled, and so when running this on .NET 4.7 I get results like:
Elapsed=00:00:09.4070345 Gen0=14 Gen1=7 Gen2=1
and on .NET Core 2.0, results like:
Elapsed=00:00:06.4286604 Gen0=4 Gen1=1 Gen2=1
Networking
Networking is a big area of focus now, and likely will be even more so moving forward. A good amount of effort is being applied to optimizing and tuning the lower-levels of the networking stack, so that higher-level components can be built efficiently.
One such change that has a big impact is PR dotnet/corefx #15141. SocketAsyncEventArgs is at the center of a bunch of asynchronous operations on Socket, and it supports a synchronous completion model whereby asynchronous operations that actually complete synchronously can avoid costs associated with asynchronous completions. However, the implementation in .NET 4.7 only ever synchronously completes operations that fail; the aforementioned PR fixed the implementation to allow for synchronous completions of all async operations on sockets. The impact of this is very obvious in code like the following:
This program creates two connected sockets, and then writes 1,000,000 times to one socket and receives on the other, in both cases using asynchronous methods but where the vast majority (if not all) of the operations will complete synchronously. On .NET 4.7 I see results like:
Elapsed=00:00:20.5272910 Gen0=42 Gen1=2 Gen2=0
whereas on .NET Core 2.0 with most of these operations able to complete synchronously, I see results instead like:
Elapsed=00:00:05.6197060 Gen0=0 Gen1=0 Gen2=0
Not only do such improvements accrue to components using sockets directly, but also to using sockets indirectly via higher-level components, and other PRs have resulted in additional performance increases in higher-level components, such as NetworkStream. For example, PR dotnet/corefx #16502 re-implemented Socket’s Task-based SendAsync and ReceiveAsync operations on top of SocketAsyncEventArgs and then allowed those to be used from NetworkStream.Read/WriteAsync, and PR dotnet/corefx #12664 added a specialized CopyToAsync override to support more efficiently reading the data from a NetworkStream and copying it out to some other stream. Those changes have a very measurable impact on NetworkStream throughput and allocations. Consider this example:
As with the previous Sockets one, we’re creating two connected sockets. We’re then wrapping those in NetworkStreams. On one of the streams we write 1K of data a million times, and on the other stream we read out all of its data via a CopyToAsync operation. On .NET 4.7, I get output like the following:
Elapsed=00:00:24.7827947 Gen0=220 Gen1=3 Gen2=0
whereas on .NET Core 2.0, the time is cut by 5x, and garbage collections are reduced effectively to zero:
Elapsed=00:00:04.9452962 Gen0=0 Gen1=0 Gen2=0
Further optimizations have gone into other networking-related components. For example, SslStream is often wrapped around a NetworkStream in order to add SSL to a connection. We can see the impact of these changes as well as others in an example like the following, which just adds usage of SslStream on top of the previous NetworkStream example:
On .NET 4.7, I get results like the following:
Elapsed=00:00:21.1171962 Gen0=470 Gen1=3 Gen2=1
.NET Core 2.0 includes changes from PRs like dotnet/corefx #12935 and dotnet/corefx #13274, both of which together significantly reduce the allocations involved in using SslStream. When running the same code on .NET Core 2.0, I get results like the following:
Elapsed=00:00:05.6456073 Gen0=74 Gen1=0 Gen2=0
That’s 85% of the garbage collections removed!
Concurrency
Not to be left out, lots of improvements have gone into infrastructure and primitives related to concurrency and parallelism.
One of the key focuses here has been the ThreadPool, which is at the heart of the execution of many .NET apps. For example, PR dotnet/coreclr #3157 reduced the sizes of some of the objects involved in QueueUserWorkItem, and PR dotnet/coreclr #9234 used the previously mentioned rewrite of ConcurrentQueue<T> to replace the global queue of the ThreadPool with one that involves less synchronization and less allocation. The net result in visible in an example like the following:
On .NET 4.7, I see results like the following:
Elapsed=00:00:03.6263995 Gen0=225 Gen1=51 Gen2=16
Elapsed=00:00:03.6304345 Gen0=231 Gen1=62 Gen2=17
Elapsed=00:00:03.6142323 Gen0=225 Gen1=53 Gen2=16
Elapsed=00:00:03.6565384 Gen0=232 Gen1=62 Gen2=16
Elapsed=00:00:03.5999892 Gen0=228 Gen1=62 Gen2=17
whereas on .NET Core 2.0, I see results like the following:
Elapsed=00:00:02.1797508 Gen0=153 Gen1=0 Gen2=0
Elapsed=00:00:02.1188833 Gen0=154 Gen1=0 Gen2=0
Elapsed=00:00:02.1000003 Gen0=153 Gen1=0 Gen2=0
Elapsed=00:00:02.1024852 Gen0=153 Gen1=0 Gen2=0
Elapsed=00:00:02.1044461 Gen0=154 Gen1=1 Gen2=0
That’s both a huge improvement in throughput and a huge reduction in garbage collections for such a core component.
Synchronization primitives have also gotten a boost in .NET Core. For example, SpinLock is often used by low-level concurrent code trying either to avoid allocating lock objects or minimize the time it takes to acquire a rarely contended lock, and its TryEnter method is often called with a value of 0 in order to only take the lock if it can be taken immediately, or else fail immediately if it can’t, without any spinning. PR dotnet/coreclr #6952 improved that fail fast path, as is evident in the following test:
On .NET 4.7, I get results like:
00:00:02.3276463
00:00:02.3174042
00:00:02.3022212
00:00:02.3015542
00:00:02.2974777
whereas on .NET Core 2.0, I get results like:
00:00:00.3915327
00:00:00.3953084
00:00:00.3875121
00:00:00.3980009
00:00:00.3886977
Such an ~6x difference in throughput can make a significant impact on hot paths that exercise such locks.
That’s just one example of many. Another is around Lazy<T>, which was rewritten in PR dotnet/coreclr #8963 by manofstick to improve the efficiency of accessing an already initialized Lazy<T> (while the performance of accessing a Lazy<T> for the first time matters, the expectation is that it’s accessed many times after that, and thus we want to minimize the cost of those subsequent accesses). The effect is visible in a small example like the following:
On .NET 4.7, I get results like:
00:00:02.6769712
00:00:02.6789140
00:00:02.6535493
00:00:02.6911146
00:00:02.7253927
whereas on .NET Core 2.0, I get results like:
00:00:00.5278348
00:00:00.5594950
00:00:00.5458245
00:00:00.5381743
00:00:00.5502970
for an ~5x increase in throughput.
What’s Next
As I noted earlier, these are just a few of the many performance-related improvements that have gone into .NET Core. Search for “perf” or “performance” in pull requests in the dotnet/corefx and dotnet/coreclr repos, and you’ll find close to a thousand merged PRs; some of them are big and impactful on their own, while others whittle away at costs across the libraries and runtime, changes that add up to applications running faster on .NET Core. Hopefully subsequent blog posts will highlight additional performance improvements, including those in the runtime, of which there have been many but which I haven’t covered here.
We’re far from done, though. Many of the performance-related changes up until this point have been mostly ad-hoc, opportunistic changes, or those driven by specific needs that resulted from profiling specific higher-level applications and scenarios. Many have also come from the community, with developers everywhere finding and fixing issues important to them. Moving forward, performance will be a bigger focus, both in terms of adding additional performance-focused APIs (you can see experimentation with such APIs in the dotnet/corefxlab repo) and in terms of improving the performance of the existing libraries.
To me, though, the most exciting part is this: you can help make all of this even better. Throughout this post I highlighted some of the many great contributions from the community, and I highly encourage everyone reading this to dig in to the .NET Core codebase, find bottlenecks impacting your own apps and libraries, and submit PRs to fix them. Rather than stumbling upon a performance issue and working around it in your app, fix it for you and everyone else to consume. We are all very excited to work with you on bringing such improvements into the code base, and we hope to see all of you involved in the various .NET Core repos.

Great write up Stephen. The examples go on and on!
Thanks, Peter.
Will all the improvements be merged into .NET framework?
See second paragraph …
“We expect that many of these improvements will be brought to the .NET Framework over the next few releases, too”
We need to validate that these changes don’t break .NET Framework compatibility, but other than that, yes.
That’s very good to hear, but if these changes are included in the .Net Standard surface area shouldn’t they have already been determined not to break compatibility?
😉
I’m guessing the mundane answer to that is just that porting the changes over requires putting them through some sort of separate QA and change approval organisation. Only to be expected, I suppose.
Oh, and thanks for yet another great, detailed article from Stephen. Looking forward to getting these improvements on the desktop!
.NET Standard doesn’t make a statement on performance. You are right that it would be odd if performance changes in .NET Core were breaking w/rt .NET Framework. That said, breaks in .NET Framework that only affect 0.001% of customers are still a major problem due to the .NET Framework deployment model (in-place updates and Windows Update delivery). So, there are a class of changes that we’ll take in .NET Core and not .NET Framework. The .NET Core side-by-side deployment model gives us a greater degree of freedom.
A great and very detailed collection of performance improvements that made it into .NET Core over the years. Any chance these improvements will be ported to the .NET Framework in the future?
Yes. See answer to the same question above. The timing of your question and my answer hit a race condition.
I see what you did there. Nice work and thanks for sharing the results! Exciting stuff.
Does .NET Core 2.0 shines equally well vs. Mono?
Very likely. We didn’t test that. Anyone is welcome to try this experiment and report back findings.
Folks are also working to merge more and more of corefx into Mono (and vice versa where applicable), such that wherever possible we use the same codebase across them. Then improvements to one accrue to the other.
Just as a side remark, there’s a nice tool Benchmark.NET which not only displays the benchmarks neatly but also summarizes the average/min/max and shows graphs so readers don’t have to skim 5 lines of 00:00….
Good point. This is called out at the start.
It would be great for someone to take these samples and make them work with Benchmark.NET. If someone did that, I’d add the link to this post, displayed prominently.
First, this is a really great article and another gem by Mr. Toub. Also always glad to see Mr. Lander really owning the comments, too. I read the disclaimer against Benchmark.NET and was really surprised, as this is pretty common these days when benchmarking is viewed in .NET articles. It doesn’t seem right without it. BDN is also part of the .NET Foundation, so that adds to the surprise.
I feel bad for Stephen saying these things but, the reward for hard work is more hard work. 🙂 If my own personal development environment wasn’t such a disaster (virtualizing everything in Hyper-V and standing up VS2017, FINALLY!) I would be all over getting you some BDN magic for you, for sure. Maybe if no one else has done in the next few days or so when I am back on my feet.
No need to feel bad. Thanks for the feedback. I actually started with Benchmark.NET, since I often use it locally, and then made the explicit choice to switch to console apps for all of the examples, as I felt it made for a better reading/trying-it-out experience. That’s why I added the note at the beginning, highlighting that a tool like Benchmark.NET (or other benchmarking tools) is the way to go for actually doing performance work. For the purposes of this post, I just wanted simple snippets, where you didn’t need to acquire anything beyond the runtime/compiler, didn’t need to understand what the boilerplate meant, etc. I’ll keep the feedback in mind for the future, though; maybe I put too much value in those criteria.
I’m doing it right now, will publish a post in a few days.
Thanks! Will happily link / tweet to your work.
This first part is ready: http://aakinshin.net/blog/post/stephen-toub-benchmarks-part1/
Hooray!
It is a shame that performance tests are always left as an after thought. Like test driven development, performance tests should be written at the start. More importantly there should be a *standard* of performance tests defined against interfaces, which would make it trivial for implementations to be run against each other to see which one is faster. In other words performance tests should be written once, in such a way that implementors can immediately test their implementations and easily compare their results against others.
Great article and thank you for sharing these improvements. I was looking for similar information for a while, but could not find anything tangible. I was hoping to see some architectural (performance) changes that could have helped reduce memory pressure and/or improve cpu utilization etc, unfortunately it seems like there are none. It is sad to see huge gaps in performance today after 10-30 years of existence (depending on dll/method) and these gaps are being addressed by the community and not Microsoft. IMO it just proves that performance was never important and always an afterthought.
I’m excited about and looking forward to new features (like Span, ref fields etc) but these are just bolted on features It would be nice to see something done under the hood. I’m glad to hear that now performance may become as important as language changes and new libs.
I was hoping .net core will become smaller and significantly faster successor (to full .Net), instead it seems like focus has shifted to just make it cross platform highly compatible with full .Net and all the legacy code retained.
> these gaps are being addressed by the community and not Microsoft
What led you to this conclusion? It is false. There were many performance improvements to choose from. Stephen did a great job of picking a selection of both Microsoft and community contributions. He could have picked all his (of which there are MANY) and that would have been fine but he didn’t do that.
> I was hoping to see some architectural (performance) changes that could have helped reduce memory pressure and/or improve cpu utilization etc, unfortunately it seems like there are none
Again, this is a selection of improvements. It’s possible that the title of the post was too broad (in hindsight). We have other performance investments that are more architectural in nature but they are not in this post.
> I’m glad to hear that now performance may become as important as language changes and new libs.
Performance has always been at least as important as language changes and libs, particularly for the runtime. For example, our GC investments over the last 10+ years have all been performance-oriented. Little secret … our GC changes have resulted in very non-trivial reductions in the cost of running Microsoft services. We have similar feedback from customers, too. This post demonstrates that we haven’t done enough performance work in the BCL. We intend to bring much of that value to the .NET Framework.
> I was hoping .net core will become smaller and significantly faster successor (to full .Net), instead it seems like focus has shifted to just make it cross platform highly compatible with full .Net and all the legacy code retained.
Looks like you read a different blog post. The point of this post was to demonstrate the opposite.
TLDR; version:
“Faster and Efficient Collections, LINQ, Compression, Cryptography, Math, Serialization, Text Processing, File IO, Networking, Concurrency and MORE”.
Kudos to the Community and .NET engineering team!
Thanks, Carlos.
Love it. I developped a hobby of watching over perf pr being merged into kestrel. Not sure why, but it wery pleasant to see 30%-3000% perf increase, sort of meditation. Also it makes obvious how much work is required to write optimized code.
Thanks to Ms and comunity. Cheers!
Thanks, Dmitry.
Wow. Quite the tome! This must’ve taken you forever to write, Stephen!
It’s amazing all these perf improvements were able to be achieved, especially in such core parts of the BCL! (Who would’ve thought IndexOf had room for further optimization?) Great job to the team and all the contributors!
Thanks, MgSm88. Yeah, it took a while. The hardest part was forcing myself not to spend a lot more time covering the myriad of other improvements I didn’t mention. 🙂
While I don’t expect to ever see WinForms and WPF become a part of .NET Core, I really would like to see them extracted from .NET Framework so they’ll work WITH core when the target is known to be Windows. Ideally, the .NET framework should simply by the Windows build of .NET Core along with whatever additional modules are needed to fill out the non-portable API surface area of framework.
Great blog Mr. Toub! This work by MS teams and community are creating a bright story for .NET Core. I appreciate the work you have also done as I try to keep up with your GitHub progress.. along with asking myself when do you and others (@davidfowl, etc.) sleep?!?!
If I could have one request… when the new performance APIs like Span (and others from CoreFxLab) move into a released version of .NET Core/Standard, would it be possible to release a development series (e.g. Channel9, tutorials, etc.) explaining how/when to use the APIs effectively? I have current projects (and will have more when better tooling/platform support come for microservices and IoT) that could benefit using the performance APIs… just need a good overview for scenarios when to use (or not to use).
As always, thanks again for the work!
Sleep? We love .NET and like to spend most of our time working on it!
I’ll ask Immo (terrajobst) to do a blog/video series on CoreFxLab tech as it moves to CoreFX (or other repos). Good suggestion!
Thanks, shaggygi.
Awesome work and a great set of improvements. It’s nice to see everything is getting attention. These improvements will really add up over a whole project. Nice to see Enumerable.Concat in there too.
Thanks, Daniel. And yeah, a lot of work has gone into LINQ; it’s also one of the areas where most of the improvements were driven and implemented by community members. Awesome.
I’m am really impressed by improvements in AsyncSocketEventArgs.
But what is interesting NetworkStream is now almost a zero abstraction!
Thanks, OmariO. There’s still more opportunity for improvement in System.Net.Sockets, but we’ve collectively made great progress, and I think the results in .NET Core 2.0 will be very welcome to a large number of apps.
One big heap oversight: Stopwatch can be made into a value type, which would also let you clone it cheaply. Diagnostic code can have a little or a lot of these.
Asynchronicity, thanks for the feedback. Changing a reference type to a value type is a breaking change, so that’s not something I’d expect to see happen. There is also value (no pun intended) in having it be a class, in that it allows you to pass the instance around without concern for whether you’re accidentally copying and then mutating a copy rather than the original. But I agree it would be nice if there were a way to have similar functionality but without the allocation. If you have suggestions for how to best design that, please open an issue in the corefx repo. For example, Stopwatch currently provides a static GetTimestamp method, but interpreting the result of that isn’t super easy. We could add a method like `public static TimeSpan Stopwatch.GetElapsed(long timestamp)`, which would let you effectively start a stopwatch by calling GetTimestamp and stop it by passing the result to GetElapsed in order to get the elapsed time. That’s not necessarily the right answer, just an example of the kind of thing that could be done. We’ll look forward to hearing your suggestions in the corefx repo.
Due to the ticking of the stopwatch that can potentially slow things down (Fractionally), it’s generally considered better to use DateTime.Now before and after the benchmark and get difference between the two as a TimeSpan 🙂
Amazing stuff! Hopefully this will be added to the .NET Framework aswell soon 🙂
You should also add benchmark with Single and Last comparison. I think that Single and Last have vary bad imp in standard .NET framework.
Last has had a lot of work done on it for corefx as well; just one of the many sets of improvements I didn’t mention. That said, if you see ways these can be improved further, please open issues / submit PRs / etc. in the corefx repo. We look forward to your contributions!
You forgot about HashSet copy constructor 🙂
Didn’t forget; just one of the many improvements that’s been made and I didn’t have room to cover. 🙂
This is an excellent article and worthy of praise for the work with the community.
Any chance of a similar piece about new/upcoming changes to IL generation and execution?
Thanks, Alexandre. I believe several folks are planning (or at least intent on writing) follow-up articles about code generation, runtime improvements, etc., as I mainly focused on library improvements.
Wow, impressive improvements. Shame work’s still on Framework, and most of my hobbyist time is taken up by Unity.
Great news, I’m a big fan of C#! But how does the performance compare to C/C++ now?
DigitalNutcase, it really depends on the workload. For raw compute / number crunching, it’s difficult to do better than C/C++. For I/O heavy applications, well-written C# code can compete very well.
great job!
Question, I’ve read the performance of 64bit vs 32bit is very different and encourages people to stick to 32bit…which is IMO, alarming considering 64bit computing is the future and clearly, few platforms ask developers for such compromises.
Thanks.
Thanks, Lsobrado. With regards to 32-bit vs 64-bit, the big difference is that pointers/references in 64-bit are twice as big, so you end up using more memory.
Good stuff! As many of these are algorithm / code changes, is there any chance on seeing them being ported back to the main .NET Framework, too?
Wow! Great improvements! One question… It is known in the past that exception launching and “try catch finally” constructions are avoidable but, how it is at this moment, it has improved in .NET Core, it is so bad as it was told in some scenarios time ago?
Thanks!
AlexDRL, thanks. Exception throwing is still relatively expensive.
Does anyone else suspect the only things that have actually been changed are Stopwatch and GC.CollectionCount? 😉
This is seriously great stuff, and a testament to the power of proper open source. Big thanks to Microsoft and especially to all the contributors.
Ooh, you caught me 😛 Thanks, Mark. 🙂
Possible correction:
“That means that concatenating multiple enumerables grows exponentially rather than linearly with the number of enumerables involved”
I think it should become:
“That means that concatenating multiple enumerables grows quadratically rather than linearly with the number of enumerables involved”
This is fantastic, is there a step by step guide on how one would go about obtaining / changing / compiling / testing / and finally submitting a code change? I’m an experienced developer but I wouldn’t really know where to begin.
thanks a lot
Martin, great that you want to get involved. “Getting Started” docs are available at https://github.com/dotnet/corefx/wiki/New-contributor-Docs#contributing-guide
So great. Thanks for your this test.
Thanks Microsoft open-sourced .NET Core, and the involvement of this whole community for other users such as we can enjoy these improvements!!
Fantastic write-up. I hope to see something similar posted on the compiler side at some point.
Our base libraries include highly-parallelized computational code, encryption code, compression code, and so on that would benefit heavily from these improvements. But our UI built on top of that is based on WinForms. So, is it possible to compile those base libraries on .NET Core 2.0 (to get those benefits) but still use them in a .NET Framework app?
> is it possible to compile those base libraries on .NET Core 2.0 (to get those benefits) but still use them in a .NET Framework app?
Not to my knowledge.
Now if they could only get it to work with VB and the other .net languages. Otherwise ho, hum. Many native compilers out perform this anyway.
Great writeup and kudos to OSS for helping make .net core better
Question: Do we have any timeline when these .net core perf improvements will trickle back to the full .net framework? I suppose someone is helping merge these back already. Desperate to see these improvements come back to enterprise (especially who use .net for GUI dev)
Thanks for a good information in .NET Core.