Performance Quiz #8 — The problems with parsing — Part 1

I think I like parsers too much, I end up writing about them a lot.  Maybe that’s because about one program in three is, loosely speaking, a parser of some kind.  All the more reason why we should be very good about writing them.

Unlike some of my previous quizes I’m not exactly sure where this one is going to go.  I have a few ideas but I’m going to go on the voyage of discovery with you this time and see where it leads us; and maybe we’ll take some twists and turns that I didn’t expect based on what you all suggest.  That should make it all the more fun.

To start us out, I wrote a quick and dirty little parser.  It’s motivated by a parser that’s in a piece of code I’m working on right now. This parser basically takes boolean expressions that can look something like this: “(fact1 & fact2) | !fact3” and evaluates them to true or false based on whether or not fact1, fact2… etc. are true.  The design is to assume that comparatively few facts will be true in any given run (maybe a few dozen) but there could be thousands of predicates (expressions) to evaluate.    I’ve made a little harness that has the parser and some test code wrapped around it and it’s posted here:

This time I’m going to ask some open ended questions:

Q1: What’s wrong with this parser from a performance perspective?  

Q2: What should we be doing to improve it? 

Q3: How big of a difference is it likely to make?

These questions touch on the cornerstone of good engineering practices — understanding what matters and what does not and prioritizing the work so that you don’t spend a lot of time working on things that ultimately will not matter.  But to really cement this think about what I consider to be the hardest the most important question of all:

Q4: If you hadn’t written all the code yet, what would you do to get a better idea what was going to matter and what wasn’t?

See the continuation in Performance Quiz #8 — The problems with parsing — Part 2

Comments (19)

  1. ricom says:

    I wrote the parser as a throwaway so there’s several corners cut to keep the code simple while retaining its usefulness as a benchmark. Feel free to comment on non-performance aspects of the code if you like — it might be educational to many of the readers — but I probably won’t say much more on those topics other than "of course you’re right" 🙂

  2. we could start by using a ref parameter in the EvaluatePredicate function

  3. Evin says:

    You say there may be thousands of predicates. Will each predicate be evaluated only once, or will each predicate have a chance of running thousands of times (as the benchmark does)?

    If each predicate will be run many times, it would definitely be worthwhile to build a data-structure or generate IL for each predicate. Perhaps a 5x speed increase from doing this.

    If each predicate will be run only a few times, it’s much harder. I’d suggest stopping the parse as soon as you definitely knew the answer. For example, with "false & …. | something", when you reach the "&", you could skip over the "…." part (paying attention to parentheses!) until you reached an operator of lower precedence like the "|". I’m not sure how much faster this will be than the original parser, and it will be significantly more complicated.

  4. ricom says:

    Let me clarify that: the predicates tend to be evaluated in a batch. So if there were 1000 predicates we might run all 1000 of them and then a little later run all 1000 of them again and so forth. The benchmark magnifies the effect but I didn’t want to burden the test case with 1000s of predicates so I substituted extra repetitions. I think that *will* matter somewhat in the results so we’ll have to consider that when choosing where to go from here.

    Compiling each predicate into IL certainly could be done.

    Reader exercise: How could we get an idea what benefits that might bring?

  5. SeanH says:

    Without actually running the code my guesses are below,


    (1) String allocations in GetToken() might be an issue

    (2) Memory locality might be an issue

    (3) Recursion might be an issue because of the time to create and destroy stacks


    (1) a. convert strInput to a char[]

    b. move strInput to EvaluatePredicate() and use stackalloc

    (2) a. move ichInput to EvaluatePredicate()

    b. pass strInput and ichInput by ref to methods that need them

    (3) Consider refactoring to the parser to have a dispatcher

  6. SeanH says:

    Whoops, forgot 1 more guess …


    (1) c. return char[] index range from GetToken()

  7. Vadmyst says:

    q2. if the expression will be long enough, recursion may be the issue.

    q4. I think that extensibility and parser efficiency would matter.

  8. Dave Boyle says:

    Is this a trick question, Rico? 🙂

    If it is, then my answer is that the problem is NP-Complete as it is equivalent to SAT3.


    Dave Boyle

  9. Jon Young says:

    You can get a factor of a 1000 if you change string GetToken to char GetToken.

    char GetToken returns:

    ‘X’ for null

    ‘T’ for a true fact

    ‘F’ for a false fact

    ‘!’,’&’,’|’,'(‘,’)’ for operators

  10. Jon Young says:

    Oops copied the wrong number. It should read "a factor of 1.2".

  11. Jon Young says:

    I would also add

    Thread.Sleep( 1000 );



    Otherwise the updating the icon in task bar is included in the timing.

  12. Dave R. says:

    Recursion. I’d like to see whether a stack-based implementation made a difference. However, I’ve not written a parser so apologies if I’m misinformed.

  13. Quickie optimizations:

    – You create a substring object per token. You could use an array-of-characters for the expression and pass around structs with the start and end indexes for a token. Win: average_num_tokens_per_expression * (overhead to create a string + overhead to copy an average token) – overhead_to_pass_around_those_structs

    – Short-circuiting for binary operations, though you would still have to parse the expression.

    More involved optimizations:

    – You said, "there could be thousands of predicates (expressions) to evaluate". Is each of those predicates evaluated only once? Or could it be evaluated multiple times?

    – Build a parse tree for each predicate. Find common subexpressions among the parse trees.

    – The value of each "fact" doesn’t ever change, does it? If it doesn’t, you could memoize the results for the common subexpressions. This would be a big benefit if coupled with short-circuiting on the parse tree.

    – Do you know the value of each "fact" in advance, or do you prefer to compute it only when you access it? If it is the latter, of course memoize it, assuming that the value of a "fact" never changes.

    – Use Reflection.Emit to generate code for each predicate, using the computed/memoized values of each "fact", and assume that the JIT will take care of optimizing it.

  14. ricom says:

    The facts must be assumed to change over time otherwise you could just evaluate the predicates once and for all which would make the test case very much less interesting.

    This isn’t a trick question, but of course there is no completely trivial answer or anything. It’s just an interesting problem.

  15. Jeffrey Sax says:

    Q1: The biggest issue is the use of strings. The scanner doesn’t classify the tokens. It merely chops the input up into pieces.

    Q2: The simplest change: Use string constants for the delimiter tokens, and use Object.ReferenceEquals to check for equality. (This works because the string constants are interned. The interned token strings act as token identifiers.)

    A more complex change is to throw out the recursive descent parser + tokenizer and replace them with a hand-coded parser. The language is simple enough to do that. There is some duplication between the scanner and the parser that would be eliminated this way.

    Q3. I’ve always found this hard to quantify. It’s especially hard to predict the impact of the hand-crafted parser. It may not end up being much faster than the optimized version of the current code.

    Q4. We know that doing lots of small allocations can be relatively costly if there is not much other code being executed. With a parser, I would certainly look at ways to avoid unnecessary allocations.

    Two comments on the test data:

    1. The timings are somewhat dependent on the frequency and ordering of the operators, as well as the complexity of the predicates. There is too much variation at the moment to make accurate timings for minor optimizations.

    2. The assumption that "comparatively few facts will be true" is violated: 60% of the used facts are true.

    And a comment on the .NET BCL and parsing: On several occasions I have found it would be useful to have "substring" overloads of methods like TryParse, i.e. something like TryParse(string, startIndex, out result, out endIndex). Dictionary lookups on a substring would be useful in this scenario.

  16. Norman Diamond says:

    When I write parsers by hand I find it easier to always have a lookahead token instead of the repetitive testing needed to decide whether to get one and then consume it. But this is minor from a performance point of view, just a simplification to the development effort.

    As for reading GetToken() (or the rest of the source), my eyes can’t handle it. Did you parse your parser after posting it?

  17. ricom says:

    Jeffrey Writes:

    >>2. The assumption that "comparatively few facts will be true" is violated: 60% of the used facts are true.

    Perhaps I could have been clearer — the number of true facts is comparatively small. I didn’t mean to say anything about the ratio.

  18. ricom says:

    Norman writes:

    >>As for reading GetToken() (or the rest of the source), my eyes can’t handle it. Did you parse your parser after posting it?

    UGH! When I changed the source from <PRE></PRE> to just fixed pitch fonts so that I didn’t get bizarre paragraph breaks I inadvertantly caused all the & < and > to be HTML escaped. Yuck! I fixed it…

  19. There were some great comments from the initial posting which I encourage you to read. Many people latched