And today we conclude our little saga with one final experiment, a little analysis and some comments. I hope you’ve enjoyed reading this series as much as I’ve enjoyed doing the experiments.
First, as promised I created yet another version of this little parser. This version is very similar to the last except now it’s using our new lightweight code generation to actually make native code for each predicate. You can find parser_jitting.cs in my articles area like the others.
And how does this new parser combination fare? Surely generating native code is going to win over my lame little stack machine yes? And I’ve even stacked the deck by doing an admittedly unrealistic 1,000,000 iterations over the predicates for just one compilation.
The results are as follows:
Wait a second, it’s slower! What happened?
OK I admit it; once again I threw you a curve ball. What’s going on here is that the predicates in the inital test case are fairly simple. There just isn’t very much code. On the other hand the preamble for a call through a delegate is not totally trivial. Normally it doesn’t matter much but here those functions are so small (maybe a half dozen instructions or so) that the function overhead is just killing us. So the delegates actually lose! Now for kicks I made a version that has the predicates hard coded, that one clocked in at a brisk 70ms (same test case), so that’s a good baseline for the fastest this could possibly be. I used that version as a sanity check and as a template for how to create the IL in my jitting version.
But all is not lost. As has been pointed out to me these predicates are not especially complicated, maybe unrealistically so. So while strictly speaking the analysis for the benchmark as given is done at this point let’s do a little more bonus work and see what we get.
Now we can see that if we can pay for the overhead of invoking a delegate then things go a lot better for the jitted version. But what about those compilation costs? Surely we shouldn’t just disregard the cost of invoking the JIT. So again while it wasn’t in my original formulation lets go on for even more bonus points and see what the cost of compliation is in a more realistic mix of compiling and evaluating. Here I added a LOT more predicates, duplicating the 9 above to get us above 2000 predicates. Then I turned the iteration count way down to 20 which I think is more like what I’ll see in the application that motivated this whole problem. Finally I eliminated the test output and moved the timer start to the top of Main. And here are the results for the “more realistic mix.”
I think it’s very clear which one I should pick for my actual project at this point. There just aren’t enough evaluations to pay for the cost of jitting. Even the original parser is holding up all right; with no compilation costs and fewer evaluations its problems are not being multiplied.
Now let me go back and answer my original questions directly.
Q1: What’s wrong with this parser from a performance perspective?
Fundamentally the problem is that evaluation has not been seperated from parsing and that as a consequence each piece of input is being examined much more often than it needs to be.
Q2: What should we be doing to improve it?
Seperate the costs of parsing from the costs of evaluation. Reduce the cost of evaluation by using a better truth mechanism.
Q3: How big of a difference is it likely to make?
We can guess at this by thinking about how often we look at any given character of input. We scan it in getc, then we switch on it (hash and equality), then we do a lookup (hash+equality), then we copy it onto the heap, there is no per-iteration cost of compacting it off the heap as dead strings on the heap do not have to be moved (only live strings are moved). So I’m seeing memory traffic of at least 8 times the sum of all the parse strings (per iteration). In contrast if the strings were pre-tokenized then the memory traffic is more like 1/2 the sum of all the strings (per iteration). So I’d ballpark the gains available to be around a factor of 16 just on that basis. We ended up getting somewhat more as we did more than just eliminate the pressure on the memory system.
Q4: If you hadn’t written all the code yet, what would you do to get a better idea what was going to matter and what wasn’t?
Well I think I gave away the answer in Q3. I’d approach this my looking at how much work I do per character of input on average and go from there. That gives us good ballpark numbers without too much work.
Thank you very much for all the emails and comments!