Back in 1996, I was a QA lead for the C++ compiler, and our group wanted to incent people to fix and close bugs.
One of the other QA leads had the brilliant insight that lego blocks make excellent currency amongst development teams, and I – because of my demonstrated aptitude in generating reports from our bug-tracking system – became the “Lego Sheriff” for the group, handing out blocks. I believe the going rate was three blocks per bug.
Not surprisingly, some people started to game the system, to increase the number of blocks. Those of you who are surprised that somebody would go to extra effort to get blocks that retail at about a penny per block have never seen a millionaire fight for to get a free $10 T Shirt.
But I digress.
That there was a system to game was due to a very simple fact. Our goal wasn’t really to get people to fix and close bugs, our goal was to get the product closer to shipping. But we didn’t have a good way to measure the individual contribution to that, so we choose active and resolved bug counts as a surrogate measure – a measure that (we hoped) was well correlated with the actual measure.
This was a pretty harmless example, but I’ve seen lots of them in my time at Microsoft.
The first one I encountered was “bugs per tester per week”. A lead in charge of testing part of the UI of visual studio ranked his reports on the number of bugs they entered per week, and if you didn’t have at least <n> (where <n> was something like 3 or 5), you were told that you had to do better.
You’ve probably figured out what happened. Nobody ever dropped below the level of <x> bugs per week, and the lead was happy that his team was working well.
The reality of the situation was that the testers were spending time looking for trivial bugs to keep their counts high, rather than digging for the harder-to-find but more important bugs that were in there. They were also keeping a few bugs “in the queue” by writing them down but not entering them, so they could make sure they hit their limit.
Both of those behaviors had a negative impact, but the lead liked the system, so it stayed.
Another time I hit this was when we were starting the community effort in devdiv. We were tracked for a couple of months for things like “newsgroup post age”, “number of unanswered posts”, or “number of posts replied to by person <x>”.
Those are horrible measures. Some newsgroups have tons of off-topic messages that you wouldn’t want to answer. Some have great MVPs working them that answer so fast you can’t a lot to say. Some have low traffic so there really aren’t that many issues to address.
Luckily, sharper heads prevailed, and we stopped collecting that data. The sad part is that this is one situation where you *can* measure the real measure directly – if you have a customer interaction, you can *ask* the customer at the end of the interaction how it went. You don’t *need* a surrogate.
I’ve also seen this applied to blogging. Things like number of hits, number of comments, things like that. Just today somebody on our internal bloggers alias was asking for ways to measure “the goodness” of blogs.
But there aren’t any. Good blogs are good blogs because people like to read them – they find utility in them.
After this most recent incident of this phenomena presented itself, I was musing over why this is such a common problem at Microsoft. And I rememberd SMART.
SMART is the acronym that you use to remember the measures that tell you that you’ve come up with a good measure. M means measurable (at least for the purposes of this post. I might be wrong, and in fact I’ve forgotten what all the other letters mean, though I think T might mean Timely. Or perhaps Terrible…).
So, if you’re going to have a “SMART goal”, it needs to be *measurable*, regardless of whether what you’re trying to do is measurable.
So, what happens is you pick a surrogate, and that’s what you measure. And, in a lot of cases, you forget that it’s a surrogate and people start managing to the surrogate, and you get the result that you deserve rather than the one you want.
If you can measure something for real, that’s great. If you have to use a surrogate, try to be very up-front about it, track how well it’s working, don’t compare people with it, and please, please, please, don’t base their review on it.