Performance Quiz #3 -- Solution

And now the exciting solution to Performance Quiz #3!

There's actually an enourmous amount of philosophy that can be learned from this little example in terms of good habits but I'd like to anchor this in the numbers so let's begin with those.

Time Analysis

I did 4 runs with everything cached, and jitting everything, this actually gives the rad slower parser an advantage in terms of reported time because the input file isn't so terribly large that load time can be ignored.  These results were on my own desktop machine, I'm sure you'll get different numbers if you try the test but the ratio should be similar.

Rad Parser: 1.092s
Classic Parser: 0.149s
Ratio: 7.366

So the RAD parser is over 7 times slower and if I was to subtract out the cost of starting an empty program from both (about 50ms) then you'd get more like a factor of 10 for just the code we wanted to test.

Space Analysis

These statistics were gathered by running the programs under CLRProfiler which tracks every managed allocation.

Rad Parser

Total Allocations 48307 Objects 2357857 Bytes

Top Ten Allocated Types

   Count Bytes Type
42813 1408708 System.String
2514 713712 System.Int32 []
2515 200608 System.String []
14 10564 System.Object []
8 6183 System.Byte []
25 5690 System.Char []
171 3420 System.Text.StringBuilder
15 2160 System.Collections.Hashtable.bucket []
15 780 System.Collections.Hashtable
34 680 System.RuntimeType

Classic Parser

Total Allocations 539 Objects 46454 Bytes

Top Ten Allocated Types

   Count Bytes Type
253 15262 System.String
14 10564 System.Object []
5 4864 System.Byte []
14 3712 System.Int32 []
21 3456 System.Char []
14 2016 System.Collections.Hashtable.bucket []
14 728 System.Collections.Hashtable
14 564 System.String []
6 384 System.Globalization.CultureInfo
16 320 System.RuntimeType

That result is simply stunning, it's 50 times the number of allocated bytes, most of which obviously were just thrown away.  Lucky for us the GC was able to reguarly reclaim that memory so the impact on total workingset was considerably less (it should end up being near the size of your L2 cache because of GC hueristics).  Nonetheless allocating all that memory was not a good thing for performance.

Let's look at the allocations in a bit more detail.

I extracted the top 10 allocating stacks from both runs out of the CLRProfiler log and I've got them here in text format. 

Let's do the classic parser first this time because we can dispense with it fairly quickly.  I'm only going to show the first two of the top ten stacks.

Stack 1 allocates 7590 bytes
6 6240 System.Object []
1 72 System.OutOfMemoryException
1 72 System.StackOverflowException
1 72 System.ExecutionEngineException
1 72 System.Threading.ThreadAbortException
1 20 System.SharedStatics
1 108 System.AppDomain
11 782 System.String
1 20 System.String []
4 80 System.Security.Permissions.HostProtectionAttribute
1 24 System.Reflection.Assembly
1 28 Microsoft.Win32.Win32Native.InputRecord

Stack 2 allocates 4702 bytes
ClassicParser::Main static void (String[])
1 76 System.IO.FileStream
1 4108 System.Byte []
6 506 System.String
1 12 System.Int32

Now what's interesting here is the main function is only stack #2, you can see the byte array there and a few string literals, on boxed integer, and of course the file stream itself.  Some of those must be coming from inlining because they don't appear directly in the source code for main (might make another interesting blog entry to run all these down actually).  The biggest source of allocations had no managed stack associated with them at all -- these are objects created by the runtime at startup time to set up the enviroment, open standard file streams, and get that first appdomain running.  There are actually more of those for even small numbers of bytes further down and all of those actually appear in the RAD parser as well.

So what's interesting here is that of the 46454 bytes allocated in total, our parser is directly responsible for a mere 4702 of them.  Wow.

Let's look now at the RAD Parser stacks, there's a lot more meat on those ones.

Stack 1 allocates 985000 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
System.String::InternalSplitKeepEmptyEntries String[] (int32[] int32[] int32 int32)
System.String::Substring String (int32 int32)
40000 985000 System.String

Stack 2 allocates 710034 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
2500 710000 System.Int32 []
1 34 System.String

Stack 3 allocates 359436 bytes
RadParser::Main static void (String[])
System.IO.StreamReader::ReadLine String ()
System.String::CtorCharArrayStartLength String (wchar[] int32 int32)
2334 359436 System.String

Stack 4 allocates 200000 bytes
RadParser::Main static void (String[])
System.String::Split String[] (wchar[] int32 System.StringSplitOptions)
System.String::InternalSplitKeepEmptyEntries String[] (int32[] int32[] int32 int32)
2500 200000 System.String []

Stack 5 allocates 41208 bytes
RadParser::Main static void (String[])
System.IO.StreamReader::ReadLine String ()
System.Text.StringBuilder::.ctor void (String int32)
System.Text.StringBuilder::.ctor void (String int32 int32 int32)
System.String::GetStringForStringBuilder static String (String int32 int32 int32)
166 41208 System.String

Stack 6 allocates 7590 bytes
6 6240 System.Object []
1 72 System.OutOfMemoryException
1 72 System.StackOverflowException
1 72 System.ExecutionEngineException
1 72 System.Threading.ThreadAbortException
1 20 System.SharedStatics
1 108 System.AppDomain
11 782 System.String
1 20 System.String []
4 80 System.Security.Permissions.HostProtectionAttribute
1 24 System.Reflection.Assembly
1 28 Microsoft.Win32.Win32Native.InputRecord

All I can say is wow.  We have to get down to stack 6 before we see the 7590 bytes of root allocations.  You can see there's significant cost associated with splitting the strings -- making the arrays and the string pieces.  Another hidden cost leaps right out here as well -- since we don't know in advance how many string parts there are there's a pass over the string to find the delimeter offsets, those are stored in an temporary integer array.

Cost Per Byte

Another way of thinking about these parsers is to consider how many times any byte of the input file has to be touched, you can see in the RAD parser we have to do newline parsing, and then scanning for delimeters and splitting, and then numeric conversions.  So it's readily apparent that there will be many memory touches for each byte of the file.  Consider just this counting the CRLF there are 70 bytes per line of input and 2500 such lines, that's 175,000 bytes to read.  But we allocated over 13 times that in bytes to process the file the rad way.

The classic parser is very economical by comparison.  Even if the data had not been in the disk cache we would only add a few milliseconds of overhead to both algorithms for reading such a small file.  The memory management costs of the rad way are easily greater than any reasonable I/O cost.

Warnings and Good Practice

So should you all abandon your RAD practices and go study Knuth's parsing techniques?  Well... I dunno about that.  But there are definitely some lessons here.  I think my message is this:  If you choose the RAD parser you have made your coding that much easier but you still have a responsibility to measure the quality of your solution against some test data and see what your performance is going to be like.  Every professional piece of software should have performance goals, even if they are very simple ones that are very easy to meet. Sometimes goals like “Yeah, 10 seconds is fine” is all it takes, other times we have to have strict time and space budgets.  If you were in “Yeah 10 seconds is fine” mode then I think the RAD parser is a fine choice.

And here’s the 2nd half of the message.  If you decide to do something fancier than the RAD way, you can easily find yourself in a world of hurt on correctness issues.  Even this fairly simple parsing problem has much potential for mistakes – I put in error checks for three cases but did you notice that there’s no error check for an input token that is just a “-” – some people noticed…  It’s easy to make mistakes when you are reaching for sharper tools, which is all the more reason to understand your performance goals so that you can choose a plan that is going to economically meet those goals.  Is the “classic” parser approach doomed to be full of errors?  No I think not. We just have to be more careful, there are well understood techniques for creating provably correct parsers but we’ll obviously be investing more in our algorithm to get there.

So there’s your balance – don’t blindly choose the easiest approach, it may be much too awful.  Don’t naively choose a complex approach, the effort required to achieve correctness may be daunting.  

Set quantitative goals so that you can objectively make the necessary plans to meet them.