Performance Quiz #3 — Solution

And now the exciting solution to Performance Quiz #3!

There’s actually an enourmous amount of philosophy that can be learned from this little example in terms of good habits but I’d like to anchor this in the numbers so let’s begin with those.

Time Analysis

I did 4 runs with everything cached, and jitting everything, this actually gives the rad slower parser an advantage in terms of reported time because the input file isn’t so terribly large that load time can be ignored.  These results were on my own desktop machine, I’m sure you’ll get different numbers if you try the test but the ratio should be similar.

Rad Parser:     1.092s
Classic Parser: 0.149s
Ratio:          7.366

So the RAD parser is over 7 times slower and if I was to subtract out the cost of starting an empty program from both (about 50ms) then you’d get more like a factor of 10 for just the code we wanted to test.

Space Analysis

These statistics were gathered by running the programs under CLRProfiler which tracks every managed allocation.

Rad Parser

Total Allocations 48307 Objects 2357857 Bytes

Top Ten Allocated Types

   Count    Bytes Type
   42813  1408708 System.String
    2514   713712 System.Int32 []
    2515   200608 System.String []
      14    10564 System.Object []
       8     6183 System.Byte []
      25     5690 System.Char []
     171     3420 System.Text.StringBuilder
      15     2160 System.Collections.Hashtable.bucket []
      15      780 System.Collections.Hashtable
      34      680 System.RuntimeType

Classic Parser

Total Allocations 539 Objects 46454 Bytes

Top Ten Allocated Types

   Count    Bytes Type
     253    15262 System.String
      14    10564 System.Object []
       5     4864 System.Byte []
      14     3712 System.Int32 []
      21     3456 System.Char []
      14     2016 System.Collections.Hashtable.bucket []
      14      728 System.Collections.Hashtable
      14      564 System.String []
       6      384 System.Globalization.CultureInfo
      16      320 System.RuntimeType

That result is simply stunning, it’s 50 times the number of allocated bytes, most of which obviously were just thrown away.  Lucky for us the GC was able to reguarly reclaim that memory so the impact on total workingset was considerably less (it should end up being near the size of your L2 cache because of GC hueristics).  Nonetheless allocating all that memory was not a good thing for performance.

Let’s look at the allocations in a bit more detail.

I extracted the top 10 allocating stacks from both runs out of the CLRProfiler log and I’ve got them here in text format. 

Let’s do the classic parser first this time because we can dispense with it fairly quickly.  I’m only going to show the first two of the top ten stacks.

Stack 1 allocates 7590 bytes
       6     6240 System.Object []
       1       72 System.OutOfMemoryException
       1       72 System.StackOverflowException
       1       72 System.ExecutionEngineException
       1       72 System.Threading.ThreadAbortException
       1       20 System.SharedStatics
       1      108 System.AppDomain
      11      782 System.String
       1       20 System.String []
       4       80 System.Security.Permissions.HostProtectionAttribute
       1       24 System.Reflection.Assembly
       1       28 Microsoft.Win32.Win32Native.InputRecord

Stack 2 allocates 4702 bytes
ClassicParser::Main static void (String[])
       1       76 System.IO.FileStream
       1     4108 System.Byte []
       6      506 System.String
       1       12 System.Int32

Now what’s interesting here is the main function is only stack #2, you can see the byte array there and a few string literals, on boxed integer, and of course the file stream itself.  Some of those must be coming from inlining because they don’t appear directly in the source code for main (might make another interesting blog entry to run all these down actually).  The biggest source of allocations had no managed stack associated with them at all — these are objects created by the runtime at startup time to set up the enviroment, open standard file streams, and get that first appdomain running.  There are actually more of those for even small numbers of bytes further down and all of those actually appear in the RAD parser as well.

So what’s interesting here is that of the 46454 bytes allocated in total, our parser is directly responsible for a mere 4702 of them.  Wow.

Let’s look now at the RAD Parser stacks, there’s a lot more meat on those ones.

Stack 1 allocates 985000 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
System.String::InternalSplitKeepEmptyEntries String[] (int32[] int32[] int32 int32)
System.String::Substring String (int32 int32)
   40000   985000 System.String

Stack 2 allocates 710034 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
    2500   710000 System.Int32 []
       1       34 System.String

Stack 3 allocates 359436 bytes
RadParser::Main static void (String[])
System.IO.StreamReader::ReadLine String ()
System.String::CtorCharArrayStartLength String (wchar[] int32 int32)
    2334   359436 System.String

Stack 4 allocates 200000 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
System.String::InternalSplitKeepEmptyEntries String[] (int32[] int32[] int32 int32)
    2500   200000 System.String []

Stack 5 allocates 41208 bytes
RadParser::Main static void (String[])
System.IO.StreamReader::ReadLine String ()
System.Text.StringBuilder::.ctor void (String int32)
System.Text.StringBuilder::.ctor void (String int32 int32 int32)
System.String::GetStringForStringBuilder static String (String int32 int32 int32)
     166    41208 System.String

Stack 6 allocates 7590 bytes
       6     6240 System.Object []
       1       72 System.OutOfMemoryException
       1       72 System.StackOverflowException
       1       72 System.ExecutionEngineException
       1       72 System.Threading.ThreadAbortException
       1       20 System.SharedStatics
       1      108 System.AppDomain
      11      782 System.String
       1       20 System.String []
       4       80 System.Security.Permissions.HostProtectionAttribute
       1       24 System.Reflection.Assembly
       1       28 Microsoft.Win32.Win32Native.InputRecord

All I can say is wow.  We have to get down to stack 6 before we see the 7590 bytes of root allocations.  You can see there’s significant cost associated with splitting the strings — making the arrays and the string pieces.  Another hidden cost leaps right out here as well — since we don’t know in advance how many string parts there are there’s a pass over the string to find the delimeter offsets, those are stored in an temporary integer array.

Cost Per Byte

Another way of thinking about these parsers is to consider how many times any byte of the input file has to be touched, you can see in the RAD parser we have to do newline parsing, and then scanning for delimeters and splitting, and then numeric conversions.  So it’s readily apparent that there will be many memory touches for each byte of the file.  Consider just this counting the CRLF there are 70 bytes per line of input and 2500 such lines, that’s 175,000 bytes to read.  But we allocated over 13 times that in bytes to process the file the rad way.

The classic parser is very economical by comparison.  Even if the data had not been in the disk cache we would only add a few milliseconds of overhead to both algorithms for reading such a small file.  The memory management costs of the rad way are easily greater than any reasonable I/O cost.

Warnings and Good Practice

So should you all abandon your RAD practices and go study Knuth’s parsing techniques?  Well… I dunno about that.  But there are definitely some lessons here.  I think my message is this:  If you choose the RAD parser you have made your coding that much easier but you still have a responsibility to measure the quality of your solution against some test data and see what your performance is going to be like.  Every professional piece of software should have performance goals, even if they are very simple ones that are very easy to meet. Sometimes goals like “Yeah, 10 seconds is fine” is all it takes, other times we have to have strict time and space budgets.  If you were in “Yeah 10 seconds is fine” mode then I think the RAD parser is a fine choice.

And here’s the 2nd half of the message.  If you decide to do something fancier than the RAD way, you can easily find yourself in a world of hurt on correctness issues.  Even this fairly simple parsing problem has much potential for mistakes – I put in error checks for three cases but did you notice that there’s no error check for an input token that is just a “-” – some people noticed…  It’s easy to make mistakes when you are reaching for sharper tools, which is all the more reason to understand your performance goals so that you can choose a plan that is going to economically meet those goals.  Is the “classic” parser approach doomed to be full of errors?  No I think not. We just have to be more careful, there are well understood techniques for creating provably correct parsers but we’ll obviously be investing more in our algorithm to get there.

So there’s your balance – don’t blindly choose the easiest approach, it may be much too awful.  Don’t naively choose a complex approach, the effort required to achieve correctness may be daunting.  

Set quantitative goals so that you can objectively make the necessary plans to meet them.

Comments (10)

  1. Anthony Moore says:

    I am a colleague of Rico, and I feel I should comment on these results. My team owns the RAD IO, String and number parsing routines compared here.

    This significant performance difference might make you wonder if you should be using these APIs and whether there will always be such differences between roll-your-own solutions and general-purpose ones. Rico’s conclusion points at some of the reasons to be careful here. I should also comment on some future plans and to what extent it will be possible to have the best of both worlds.

    Some considerations in interpreting these numbers:

    – A notable result is that a 50X increase in allocations results in only a 7X increase in execution team. The CLR’s GC is incredibly efficient at allocating and cleaning up small, short-lived objects.

    – It is kind of apples and oranges to compare RAD Object Oriented API performance with C/C++ style coding using buffers. Environments like .NET, Java, Jscript, VB6 and Smalltalk all allocate a lot more objects than buffer-based C and C++. Most developers find that the productivity benefits outweigh the specific performance and memory costs most of the time.

    – As the example shows, you sacrifice productivity, readability, correctness, features and flexibility by going down this path. As an example, none of these routines are globalization-aware. As another, it would be expensive to switch to parsing types other than Integer if you stick with this approach.

    – Why would we ship a product with such apparent performance problems? This example is contrived in that the parse is doing no actual work other than extracting values. In most real scenarios the “actual work” tends to outweigh overhead like this, and these APIs do not show up high in performance profiles. Another factor is that String and IO operations tend to be significantly faster than operations like database access and network access, and so they don’t often create performance bottlenecks.

    That being said, there may be cases were you do need to re-implement .NET Framework APIs to tailor them to a specific purpose that is a bottleneck for you. A key point is that we really want YOU to let us know when you need to do this, so we can allocate time to making the APIs faster for everyone in future.

    Future Plans

    We would like to get the point where we can provide RAD alternatives such that you could not easily re-implement a better one for performance reasons. Roughly speaking, we would like to provide APIs that perform within 2X of these hand-rolled, special cased versions. Plans we have for future versions:

    1) Improving the raw performance of integer parsing but special casing the code, which is currently using a code path shared with Double and Decimal parsing, which is a lot more complicated.

    2) Providing a way to split a String without needing to create any temporary objects. This is possible if a new splitting API is introduced and some new abstractions.

    3) Providing a way to parse the base data types from a sub-string, or from a part of a Char Array.

    However, we really need as much input from YOU as possible. Feedback in blog comments is fine, but the recommended way is to go through the official feedback mechanism:

    General Guidance

    – Use most RAD APIs as a starting point as the situation requires.

    – Use some RAD APIs with care. It is easy to abuse an API like String.Split where you don’t have a need for all the intermediate strings, for example.

    – Just because you can make an API faster does not necessarily mean you should. It’s a cost/benefit situation.

    – If you need to rewrite a standard API because it is a real performance bottleneck for you, please tell us so we can make the APIs better.

    Thank You.

  2. Anthony Moore says:

    Furthermore, I should probably go into more detail about how we approach API performance.

    During initial development, we try to make a reasonable trade-off between correctness, performance, security and maintainability. Security is particularly relevant here, as there is zero tolerance for errors that could expose security holes in the platform. As an illustration of the trade-offs, we almost never use unsafe code during initial development, because the increased risk of introducing a security error does is not usually justified by the small performance benefits you can get.

    There can be interesting correctness, maintainability and code size trade-offs also. As an example, most number parsing and formatting code is shared for all numeric data types. We could make optimizations for simple cases of integer parsing or where invariant culture is used, but this would likely require a significant amount of additional code.

    Thus, it is not the case that we make every API absolutely as fast as we can make it. We still try hard to make them very fast. However, we focus our time and resources to make the most important APIs as a fast as possible.

    For APIs at the base of the platform, we rely on partner scenarios and customer scenarios to drive which APIs to optimize further. Thurs, there are some APIs such as String.CompareOrdinal, the Hashtable indexer and StringBuilder.Append that were very highly optimized in V1.0 because of how frequently they appeared in profiling logs for key scenarios. Unlike some of the APIs mentioned in Rico’s example, it would be extremely difficult to create a roll-your-own version of these that would be much faster than the ones in the BCL. For some primitive operations like for-each on Array and String.Length, it would be nearly impossible because they are special-cased by the JIT.

    In some cases the performance is a consequence of the API design. In this example, String.Split and Int32.Parse force you into creating sub-strings. That does not mean that there is anything wrong with these, as it is the simplest abstraction to deal with. However, if there is significant customer or partner demand, we may need to create optional versions of these APIs that are a little more complicated to use or may have fewer features, but can go faster or eliminate the need for intermediate string objects. Having additional ways of doing the same thing makes the object model more complex and makes the libraries bigger, so you need to really sure such new APIs are justified.

    This is why your feedback on which APIs to duplicate or tune is very important. Please let us know:

  3. Terry Russell says:

    I consider myself fortunate, in that my first job was with Intel writing a profiler, so I’ve always thought in terms of performance. The truth is most developers don’t, and even if they wanted to, few companies encourage it. The pressure is always to write "code faster", not "write faster code". I worked for a "Web Analytics company" that tracked over 100 million hits a day by processing log files. Each hit was processed over 8 times, not to mention how often each process "touched" a particular byte. Everyone knew better, but the pressure was always on to create the next feature. The sad truth is most companies rely on the system, and "Moores law" to be performante.

  4. Terry Russell says:

    Part of the current problem is that everything is too much of a black box, and it takes too much effort/time to dig deeper and profile. Tools like Visual Studio need to provided stronger profiling techniques that allow for RAD of peformante code.

  5. Brad Abrams says:
  6. Terry: Why is that sad? If the system meets the needs of the company, then isn’t it doing its job? Hasn’t the company stated by doing this that these features are more beneficial to it them just speeding up the performance of the current system?

  7. Rico Mariani says:

    I like to think of the process as using the same discipline even if sometimes just a very small amount of perf work is all that’s called for.

    Looking at Cyrus comment above I wonder, how do we know that it’s meeting the needs of the company if we hadn’t got an idea what those needs were?

    And the answer: In the easy case the needs are so modest that any remotely sensible solution will meet them.

    So all the performance work you need do is check that your runtime (or whatever metric) is in say the "5 minute zone" or the "30 minute zone" or something like that. Very simple goal, little or no planning required to meet the goal, and ultra simple validation.

    These are the same principals you would use if it had been a harsh goal. But having checked them off you can be sure you’ve met the (modest) needs of your company.

  8. weipingh says:

    DSP compiler can do inline/loop fusion and get rid of the arrays, and use a scalar.

    I guess if BCL provide partial entry for split, it is possible for compiler do the convertion.

    However not sure how general it is.

  9. Aasma says:

    yaa i want to be aware of arrays n strings