Make sure your test data is right for the job

I helped out one of colleagues recently whose customer's application hit high CPU in one test environment but not another. The reason for that was that contrary to what they thought, the test data used in one environment was not the same as in the other. Simple problem, easy mistake, easy solution. But that was not the interesting aspect. The interesting bit was the data that caused the high CPU.

To troubleshoot the high CPU we got some memory dumps which is always a good, quick and dirty way to get insight into what is going on inside any process. Sure enough, there were a number of threads whose user mode time was increasing significantly between successive dumps and they all seemed to be doing similar things. The function names gave away that they were doing some kind of spell checking of email. A bit of poking around revealed the text that was being spell checked at the time:

"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor..."

Yes, it was the classical Lorem Ipsum text so fondly used by software testers and typographers everywhere.

So was there a lot of data? No. About 2500 words and it seemed to have taken about 45 seconds when we took the second dump. Is this unreasonable? Probably not. Think about it. You are running test data through an English language spell checker and not a single word of it is English. So every single word results in a parse through the entire dictionary before we go onto the next word. It's not surprising the CPU was getting thrashed.

This data might be appropriate for testing something like graphical elements of a web page design, or the transport elements of a communications systems but it's definitely the wrong data for testing a spell checker (unless of course you are specifically trying to measure what happens when no words match!).

Bye for now

Doug