# A systematic approach to finding performance regressions using overweight analysis

I have been using this approach to do systematic analysis of performance regressions for several years now. I came up with it while looking at some tricky problems in Internet Explorer about three years ago and it’s served me well since then. The idea is pretty a simple one but it gives surprisingly good results in many cases.

I’ll be giving examples that talk about CPU as the metric but of course the same procedure works for any metric for which you can compute inclusive costs.

Nomenclature: Inclusive cost (e.g. time) is the cost in a method and everything it calls. Exclusive cost (e.g. time) is the cost from only the method itself, not counting anything it calls. Both are interesting but this technique really relies on inclusive cost.

Now the usual situation: You have some test that used to take say 50ms and now it takes 55ms. That’s a 10% growth. You want to know where to start looking and you’re fortunate enough to have a summary of costs from before and after. But there could be thousands of symbols and the costs could be spread all over the place. Also some symbols might have been renamed or other such inconvenient things. You could try staring at the traces in call-tree outlining but that gets very tedious especially if the call stacks are 50 levels deep or so. It’s when things get big and messy that having an analysis you can automate is helpful. So here’s how I do it.

First I consider only symbols that appear in both traces, that’s not everything but it’s a lot and is typically enough to give you a solid hint. For each symbol I know the inclusive cost in the base case and test case, from this I can compute the delta easily enough to tell me how much it grew. Now the magic. Since I know how much the overall scenario regressed (10% in this example) I can easily compute how much any particular symbol should have gotten slower if we take as our null hypothesis that “bah, it’s all just evenly slower because it sucks to be me” so we compute that number. So a symbol that had a previous cost of 10 in my example here should have a growth of 10% or a delta of 1. We compute the ratio of the actual delta to the observed delta and that is called the “overweight percentage” and then we sort on that. And then stuff starts popping out like magic.

I’ll have more examples shortly but let’s do a very easy one so you can see what’s going on. Suppose main calls f and g and does nothing else. Each takes 50ms for a total of 100ms. Now suppose f gets slower, to 60ms. The total is now 110, or 10% worse. How is this algorithm going to help? Well let’s look at the overweights. Of course main is 100 going to 110, or 10%, it’s all of it so the expected growth is 10 and the actual is 10. Overweight 100%. Nothing to see there. Now let’s look at g, it was 50, stayed at 50. But it was “supposed” to go to 55. Overweight 0/5 or 0%. And finally, our big winner, f, it went from 50 to 60, gain of 10. At 10% growth it should have gained 5. Overweight 10/5 or 200%. It’s very clear where the problem is! But actually it gets even better. Suppose that f actually had two children x and y. Each used to take 25ms but now x slowed down to 35ms. With no gain attributable to y, the overweight for y will be 0%, just like g was. But if we look at x we will find that it went from 25 to 35, a gain of 10 and it was supposed to grow by merely 2.5 so it’s overweight is 10/2.5 or 400%. At this point the pattern should be clear:

The overweight number keeps going up as you get closer to the root of the subtree which is the source of the problem. Everything below that will tend to have the same overweight. For instance if the problem is that x is being called one more time by f you’d find that x and all its children have the same overweight number.

This brings us to the second part of the technique. You want to pick a symbol that has a big overweight but is also responsible for a largeish fraction of the regression. So we compute its growth and divide by the total regression cost to get the responsibility percentage. This is important because sometimes you get leaf functions that had 2 samples and grew to 3 just because of sampling error. Those could look like enormous overweights, so you have to concentrate on methods that have a reasonable responsibility percentage and also a big overweight.

Below are some examples as well as the sample program I used to create them and some inline analysis.

Example 1, baseline

The sample program uses a simulated set of call-stacks and costs for its input.  Each line represents a call chain and the time in that call chain. So for instance the first line means 5 units in main. The second line means 5 units in f when called by main. Together those would make 10 units of inclusive cost for main and 5 for f. The next line is 5 units in j when called by f when called by main. Main's total goes up to 15 inclusive, f goes to 10, and j begins at 5. This particular example is designed to spread some load all over the tree so that I can illustrate variations from it.

main, 5
main/f, 5
main/f/j, 5
main/f/j/x, 5
main/f/j/y, 5
main/f/j/z, 5
main/f/k, 5
main/f/k/x, 5
main/f/l, 5
main/f/l/x, 5
main/g/j, 5
main/g/j/x, 5
main/g/j/y, 5
main/g/j/z, 5
main/g/k, 5
main/g/k/x, 5
main/g/k/y, 5
main/g/k/z, 5

Example 2, in which k costs more when called by f

This one line is changed. Other appearances of k are not affected, just the one place.

main/f/k, 10

Example 3, in which x always costs a little more

All the lines that end in x became 6 instead of 5. Like this:

main/f/j/x, 6

Example 4, in which f calls j more so that subtree gains cost

All the lines under f/j got one bigger like so:

main/f/j, 6
main/f/j/x, 6
main/f/j/y, 6
main/f/j/z, 6

And finally example 5, in which x gets faster but k gets a lot slower

All the x lines get a little better:

main/f/j/x, 4

But the k line got worse in two places

main/f/k, 15
main/g/k, 15

Let's see how we do with automated analysis of those things:

### Summary of Inclusive times for example 1, baseline

 Symbol Inclusive Cost Exclusive Cost main 90 5 f 45 5 g 40 0 j 40 10 k 30 10 x 25 25 y 15 15 z 15 15 l 10 5

This gives us the baseline of 90 units for main and you can see how all the "5" costs spread throughout the tree.

### Summary of Inclusive times for example 2, in which k costs more when called by f

 Symbol Inclusive Cost Exclusive Cost main 95 5 f 50 5 g 40 0 j 40 10 k 35 15 x 25 25 y 15 15 z 15 15 l 10 5

You can see that k has gone up a bit here but not much.  A straight diff would show you that.  However there's more to see.  Let's look at the first overweight report.

### Overweight Report

Before: example 1, baseline
After: example 2, in which k costs more when called by f

Before Time: 90
After Time: 95
Overall Delta: 5

Analysis:

 Name Base Cost Test Cost Delta Responsibility % Overweight % k 30.0 35.0 5.0 100.00 300.00 f 45.0 50.0 5.0 100.00 200.00 main 90.0 95.0 5.0 100.00 100.00 j 40.0 40.0 0.0 0.00 0.00 x 25.0 25.0 0.0 0.00 0.00 y 15.0 15.0 0.0 0.00 0.00 z 15.0 15.0 0.0 0.00 0.00 l 10.0 10.0 0.0 0.00 0.00 g 40.0 40.0 0.0 0.00 0.00

OK the report clearly shows that k is overweight and so is f.  So that gives us a real clue that it's k when called by f that is the problem.  And also it's k's exclusive cost that is the problem because all it's normal children have 0% overweight.  Not that there is a clear difference between methods with otherwise equal deltas.

### Summary of Inclusive times for example 3, in which x always costs a little more

 Symbol Inclusive Cost Exclusive Cost main 95 5 f 48 5 g 42 0 j 42 10 k 32 10 x 30 30 y 15 15 z 15 15 l 11 5

Our second example, again you could see this somewhat because x is bigger, but it doesn't really pop here.  And many methods seem to have been affected.  A straight diff wouldn't tell you nearly as much.

### Overweight Report

Before: example 1, baseline
After: example 3, in which x always costs a little more

Before Time: 90
After Time: 95
Overall Delta: 5

Analysis:

 Name Base Cost Test Cost Delta Responsibility % Overweight % x 25.0 30.0 5.0 100.00 360.00 l 10.0 11.0 1.0 20.00 180.00 f 45.0 48.0 3.0 60.00 120.00 k 30.0 32.0 2.0 40.00 120.00 main 90.0 95.0 5.0 100.00 100.00 j 40.0 42.0 2.0 40.00 90.00 g 40.0 42.0 2.0 40.00 90.00 y 15.0 15.0 0.0 0.00 0.00 z 15.0 15.0 0.0 0.00 0.00

Well now things are leaping right off the page.  We can see that x was the best source of the regression and also that l and k are being implicated.  And f and k are bearing equal cost.  We can also see that some branches are underweight.  The j path is affected more than the k path because of the distribution of calls.

### Summary of Inclusive times for example 4, in which f calls j more so that subtree gains cost

 Symbol Inclusive Cost Exclusive Cost main 94 5 f 49 5 j 44 11 g 40 0 k 30 10 x 26 26 y 16 16 z 16 16 l 10 5

Again a straight analysis with so few symbols does evidence the problem, however, it's much clearer below...

### Overweight Report

Before: example 1, baseline
After: example 4, in which f calls j more so that subtree gains cost

Before Time: 90
After Time: 94
Overall Delta: 4

Analysis:

 Name Base Cost Test Cost Delta Responsibility % Overweight % j 40.0 44.0 4.0 100.00 225.00 f 45.0 49.0 4.0 100.00 200.00 y 15.0 16.0 1.0 25.00 150.00 z 15.0 16.0 1.0 25.00 150.00 main 90.0 94.0 4.0 100.00 100.00 x 25.0 26.0 1.0 25.00 90.00 k 30.0 30.0 0.0 0.00 0.00 l 10.0 10.0 0.0 0.00 0.00 g 40.0 40.0 0.0 0.00 0.00

The J method is the worst offender, y and z are getting the same impact due to extra calls from j and j apparently comes from f.

### Summary of Inclusive times for example 5, in which x gets faster but k gets a lot slower

 Symbol Inclusive Cost Exclusive Cost main 105 5 f 52 5 g 48 0 k 48 30 j 38 10 x 20 20 y 15 15 z 15 15 l 9 5

Now we have some soup.  It is worse but things are a bit confused.  What's going on?

### Overweight Report

Before: example 1, baseline
After: example 5, in which x gets faster but k gets a lot slower

Before Time: 90
After Time: 105
Overall Delta: 15

Analysis:

 Name Base Cost Test Cost Delta Responsibility % Overweight % k 30.0 48.0 18.0 120.00 360.00 g 40.0 48.0 8.0 53.33 120.00 main 90.0 105.0 15.0 100.00 100.00 f 45.0 52.0 7.0 46.67 93.33 y 15.0 15.0 0.0 0.00 0.00 z 15.0 15.0 0.0 0.00 0.00 j 40.0 38.0 -2.0 -13.33 -30.00 l 10.0 9.0 -1.0 -6.67 -60.00 x 25.0 20.0 -5.0 -33.33 -120.00

Now again things are a lot clearer.  Those negative overweights are showing gains where there should be losses.  x is helping.  And k jumps to the top with a big 360%.  And it's 120% responsible for this mess, meaning not only did it cause the regression it also wiped out gains elsewhere.

In practice negatives are fairly common because sometimes costs move from one place to another.  Sometimes because of normal things like, in IE, a layout could caused by a timer for paint rather than caused by an explicit request from script, but we still get one layout, so the cost just moved a bit.  The overweights would show nothing new in the layout space but a big motion in timer events vs. script cost.

In practice this approach has been very good at finding problems in deep call stacks.  It even works pretty good if some of the symbols have been renamed because usually you'll find some symbol that was just above or below the renamed symbol as your starting source for investigation.

Finally you can actually use this technique recursively.  Once you find an interesting symbol ("the pivot") that has a big overweight, you regenerate the inclusive costs but ignore any stacks in which the pivot appears.  Search for new interesting symbols in what's left the same way and repeat.

The code that generated these reports is here.

Appendix

As an afterthought I ran an experiment where I did the "recursion" on the last test case.  Here are the results:

### Summary of Inclusive times for example 6, baseline with k removed

 Symbol Inclusive Cost Exclusive Cost main 60 5 j 40 10 f 35 5 g 20 0 x 15 15 l 10 5 y 10 10 z 10 10

Note k is gone.

### Summary of Inclusive times for example 6, in which x gets faster and k is removed

 Symbol Inclusive Cost Exclusive Cost main 57 5 j 38 10 f 33 5 g 19 0 x 12 12 y 10 10 z 10 10 l 9 5

Note k is gone

### Overweight Report

Before: example 6, baseline with k removed
After: example 6, in which x gets faster and k is removed

Before Time: 60
After Time: 57
Overall Delta: -3

Analysis:

 Name Base Cost Test Cost Delta Responsibility % Overweight % x 15.0 12.0 -3.0 100.00 400.00 l 10.0 9.0 -1.0 33.33 200.00 f 35.0 33.0 -2.0 66.67 114.29 g 20.0 19.0 -1.0 33.33 100.00 j 40.0 38.0 -2.0 66.67 100.00 main 60.0 57.0 -3.0 100.00 100.00 y 10.0 10.0 0.0 0.00 0.00 z 10.0 10.0 0.0 0.00 0.00

Overweight analysis leaves no doubt that x is responsible for the gains.

Tags

1. OmariO says:

Do you automate these analysis somehow?

2. ricom says:

There isn't much to do other than post-process the output of a typical sample profiler.  The formula is so simple you can just paste your results into excel and then sort.

I do have something that does the recursion but for the most part I don't even use it.  Usually one iteration tells me all I need to know.  You can see the relevant formulas in the code sample I included.  But they're so simple.  The idea is the interesting bit.

3. Kumar says:

The ScalaMeter tool for Scala can automate performance regression tests. I wonder if with a plugin it could do this analysis automatically.

4. ricom says:

I don't see why not.  It's a really simple technique.

5. Jaans says:

This should be included in Visual Studio's Performance Analysis tool. Created UserVoice post here: