Performance Quiz #8 — The problems with parsing — Part 3


The thing about performance work is that it’s very easy to be fooled into looking into the wrong areas.  That’s why you want your changes to be firmly grounded in data whenever they can be.  Or if you’re planning, you want to be thinking about what your customer experience needs to be, what the key metrics will be, and how your overall architecture will support those needs.  Whatever it is that you’re doing should be as grounded as you can make it for the stage your are in.  After that you should be doing whatever you can to verify your assumptions, as often as you can with reasonable cost.  The trick is to not go crazy applying too much effort doing all this because because that’s just as bad.


Take this little parser example.  You could conclude that all that string allocation in in GetToken() is the crux of the problems — and you’d be partly correct — but it’s not really the full picture.


To try to see what was going on, one of the first things I did was try to eliminate the GC as a source of possible issues while leaving the rest as similar as possible.  This actually illustrates a cool technique even though it turns out that it’s not really the right approach.  I wanted to eliminate all of those Substring calls while parsing — since I expect we’ll be parsing over and over again.  The substring itself is just one example of looking at the input more than once — a definate no-no in any high performance parser.  But lets see where this leads us.


What I want to do is add a member that will hold the various fact and token strings that occur in the predicates once and for all.  I’d like to have something like this:


private static System.Xml.NameTable ntTokens = new System.Xml.NameTable();


Now the reason I like this class so much is that it’s one of few that offers you the opportunity to create strings without having to first allocate a string.  You can call it starting with a character buffer and get a string back.  That’s just what we need… only the rest of the plumbing isn’t quite right.


I want to make these changes:


<                return strInput.Substring(idx, ichInput – idx);
becomes
>                // returns an existing string if there is one
>                return ntTokens.Add(arrayInput, idx, ichInput – idx);


and


<                    return strInput.Substring(idx, 1);
becomes
>                    // returns an existing string if there is one
>                    return ntTokens.Add(arrayInput, idx, 1);


To do that I also have to change the input from a string to a char array


<        private static string strInput = null;
becomes
>        private static char[] arrayInput = null;



<        static bool EvaluatePredicate(string predicate)
<        {
<            strInput = predicate;
becomes
>        static bool EvaluatePredicate(char[] predicate)
>        {
>            arrayInput = predicate;

There’s a couple more places where strInput has to be changed to arrayInput, like the getc() function, but these are very straight forward.


For my test case I wanted the predicates in character arrays now so I did this quick and dirty thing (in a real program you’d likely read the predicates from a file so you could just arrange for them to be in char arrays in the first place)


        struct CharArray
        {
            public char[] ptr;
        }


        static CharArray[] predicateChars = new CharArray[predicates.Length];


and then later:

            for (i = 0; i<len;i++)
                predicateChars[i].ptr = predicates[i].ToCharArray();


            for (i = 0; i< len; i++)
                Console.WriteLine(“{0} => {1}”, predicates[i],
                      EvaluatePredicate(predicateChars[i].ptr));

And for the test loop itself simply call EvaluatePredicate using predicateChars.


OK that little experiment is about the smallest change that removes any need to create substrings while changing nothing else.


On my test system the original code was taking 4.839s to run; the new version with these improvements takes 4.680s.  That’s about a 3.3% savings.


Now I have to tell you in all honesty that if you had asked me to comment on that change off hand I think I would have told you that I’d expect a much greater impact than just 3.3% — not that 3.3% is anything to sneeze at (see The Performance War — Win it 5% at a time) but that isn’t really the killer issue.


Lets go back and look at what the performance report shows and see if any of this makes sense.


Well the telling line is this one right here:


Exclusive% Inclusive% Function Name
0.01       4.62             WKS::GCHeap::Alloc(unsigned int,unsigned long)

The sum total of the allocations on that substring path, including all the garbage collections, was only 4.62%.  It’s no surprise then that we got less than a 4% improvement by only eliminating the garbage and doing nothing else.  The collector was actually doing a pretty darn good job of cleaning up that trash on the cheap!


Even looking further at the string creation costs you’ll see that InternalSubStringWithChecks has an inclusive time of about 22%. That means the bulk of the time creating these substrings had more to do with setting up for the array copy, moving the memory, checking the indexes for out of bounds, and so forth than it did to do with allocation.  We can only attribute 4.62% of the time to actually allocating the memory.  All of that checking and rechecking is costing us.


Let’s look elsewhere:  the getc() function is taking a total of 10.49% of the time — that’s more than the collector!  It doesn’t do anything but a couple of checks into an in memory array, but it’s called an awful lot.


Looking at some other interesting operations, you can see that just hashing strings in this run, so that we can do fact lookups, is costing us 2.9% That’s an astounding figure because you’ll recall that ALL of our allocation activity was only 4.62%. Who would have expected that the allocation figure would be so low that just the string hashes could compete with it.  And that’s not all, if you look at the full comparison situation you’ll see that checking for existing entries in the hashtable is accounting for 22% of the run time.  Again, that whole operation is something we do merely to find out if any given fact is true or false.


Well, all of this really changed my mind about how to attack this problem.   My plan looks like this:



  • preparse the predicates so that no string processing is required for evaluation
  • assign facts a number in sequence as they are encountered in the predicates
  • the preparsed predicate is a mix of fact numbers and operators probably in postfix order
  • when looking up facts simply access an array of truth bits indexed by fact number
  • as facts change simply change the array contents and then reevaluate predicates as desired
  • do the evaluation itself with a simple stack machine; no recursion in the evaluation

I’ll blog some code that does this in the next part and we’ll see how it does.


Any predictions based on what we’ve seen so far?


See the continuation in Performance Quiz #8 — The problems with parsing — Part 4

Comments (8)

  1. ricom says:

    It’s worth noting that many of the commenters on the original postings struck much closer to gold than the simple idea above to remove allocations. The best improvement suggestions have already incorporated other good ideas beyond allocation thinking. Well done!

  2. Have you considered replacing the framework’s hash map with something friendlier to caching? Take a look at Judy:

    http://judy.sourceforge.net/downloads/10minutes.htm

  3. Jon Young says:

    When you use a NameTable for matching, don’t you need to insert both tokens into the name table and then compare the references? Since NameTable must use some sort of hashing internally, doesn’t this approach just moves the hash call to a different line in the program? Plus this would hurt garbage collection by locking every string parsed.

  4. Jon Young says:

    Good performance gain would be to let GetToken() to calculate the hash code. This way the string is only read once. I tried this in my SubString method and got good results.

  5. ricom says:

    >>When you use a NameTable for matching, don’t you need to insert both tokens into the name table and then compare the references? Since NameTable must use some sort of hashing internally, doesn’t this approach just moves the hash call to a different line in the program? Plus this would hurt garbage collection by locking every string parsed.

    The name table choice basically eliminates the allocations at the cost of doing one extra hash/compare on each token. Generally that’s a good trade-off because it helps locality and of course avoids extra allocations. Of course in this case the allocations were not costing us very much at all.

    There’s more that could be done however. Once we’ve using the nametable then there is one canonical pointer for teach token so we could be using reference equality for the comparisons instead of re-hashing later on.

    As you say, you could also compute a hash on the fly while fetching the token.

    I used bigger ammunition in the proposed solution.

  6. Jeffrey Sax says:

    – preparse the predicates: since 10% of the work is in getc, this should have a sizeable effect on total execution time.

    – assign facts a number in sequence: very significant. Once again, we’re taking a costly operation out of an evaluation loop.

    – the preparsed predicate: how about an int array? Operators have values less than zero. Zero and up are indices to facts.

    – looking up facts: what is faster? bool array or bit array? My sense is bool array.

    – do the evaluation itself with a simple stack machine. This works especially well if you work with a virtual top-of-stack element. Every operation will involve this element, so the overhead of pushing on/popping off the stack can be eliminated.

    A major performance factor is if we can use a static stack for evaluation. With several predicates being extremely short, the actual evaluation takes very little time, and *any* overhead will be highly significant. Locality is an issue here, too.

    Overall: we’ve taken all the significant costs out of the evaluation loop. The result should be an order of magnitude faster.

    Still, I question whether the current benchmark accurately reflects the performance properties of the real-life scenario. This is especially true now that we have separated the preprocessing step. 8 simple predicates is not a good sample of thousands. And what is the preprocessing/evaluation ratio? 1:10? 1:100? 1:1000?

  7. ricom says:

    Jeffrey writes:

    >>Still, I question whether the current benchmark accurately reflects the performance properties of the real-life scenario. This is especially true now that we have separated the preprocessing step. 8 simple predicates is not a good sample of thousands. And what is the preprocessing/evaluation ratio? 1:10? 1:100? 1:1000?

    You’re absolutely right; I think a more realistic preprocessing/evaluation ratio for my application is about 1:20 and I think I’d want about 2000 different predicates to simulate the true load. What I need is to write a predicate auto-generator or something like that.

    I could make it better by compiling the same predicate to opcodes more than once — as you can see the solution is not especially sensitive to the number of facts as they can be packed quite tightly.

    Accurately simulating the ratio is going to be much more important when we do an experiment the one I’m thinking of for Part 5.

    Anyone got a predicate generator handy? 🙂

  8. In my last posting I made some recommendations about how to drastically improve the evaluation process