Tracking changes over time

I got some interesting questions from a colleague over the long weekend which I thought were of general interest

We don’t have an industry std benchmarks for a lot of scenarios and here is how I want to sign off on the performance of any build –

If (we are measuring the first time)
  Set the Benchmark to the achieved number
  If (the current number is better than the benchmark)
    Set the Benchmark to the new number
  Else if (the current number has degraded from the benchmark)
    Build didn’t meet the expected Performance.

My questions are:

Is the above logic reasonable to make sure the product performance doesn’t degrade?
Our product is 90% managed code and 10% unmanaged. I want to Collect some CLR counters(like memory related, interop related) to find out any anomolous situations. Can I also compare such numbers across builds with the above logic. For example Collect a counter say “Number of objects promoted from Gen0 to Gen1” , this number shouldn’t have a huge deviation across builds.
Are there any general guidelines on the CPU and Memory footprint of a Windows Service.

Here’s what I said:

Getting guidelines for your service is really about understanding your customers.

Ask yourself these questions:

  • What will your customer be doing with your service?
  • How will she deploy it?
  • What class of machine will it be running on?
  • What other applications or services need to be running side by side?
  • Budget what you can use space-wise under those circumstances based on some typical customer deployments and machine types then measure in simulated scenarios like those.

As for what you propose for regression tracking, it’s still a bit too simple to work.  Measurements vary from build to build, so it’s impossible to get consistent exact results. You must consider the trailing average and the normal variance in the measurment of each benchmark. Any change tends to indicate a problem — when the benchmark improves as often as not its because there’s now a bug that’s causing it to not do all the work. Only when things are verified as working properly can you change the acceptable range.

Get some experience with your benchmarks, track them, then watch them over time. When you have a suitable model for each benchmark then you can make statistical controls.

Sadly this is not easy stuff.

Any time I see groups working on tracking their changes carefully it makes me very happy 🙂

Comments (1)

  1. One thing missing is a gauge of the original benchmark quality. If you compare against similar products/software, you can see if you are in the right benchmark for memory and CPU usage. If you are way off the design is probably poor and needs to be rethought.