Some years ago I was forcefully introduced to the concept of statistical quality control, where the overall quality of a batch of items could be determined from an examination of a small sample. This came to mind as I've been immersed in watching demos of the new "Big Data" techniques for analyzing data.
My rude introduction to the topic came about many years ago in a different industry from IT. I was summoned to appear at a large manufacturer in York, England, to look at a delivery my company had made of glass divider panels for railway carriages. The goods inward store manager bluntly informed me that they were rejecting a delivery of 500 panels because they did not meet quality control standards.
"OK", I said, "show me some of the faults". However, it turned out that I was taking a simplistic view of the quality control process. I assumed that they unpacked them all, examined each one as they were about to fit it, and rejected any that were damaged or out of specification. What they actually did is look up the total quantity on a chart, which tells them how many to test. In my case it was 32. So they choose 32 at random and examined these. If more than one fails the test, they reject the whole batch because, statistically, there will be more than the acceptable number of defect items in the batch.
This struck me as odd because I knew that most of the batch would be perfect, some would be perhaps a little less than perfect (glass does, of course, get scratched easily), and only a few would be too bad to use. Our usual approach would be to simply replace any they found to be faulty as and when they came to fit them.
However, as the quality control manager patiently explained, this approach might work when you are installing windows in a house but isn't practical in most cases in manufacturing. If you get a delivery of 100,000 nuts and bolts, you can't examine them all - you just need to know that the number of faulty ones is below your preset acceptance level (perhaps 1%), and you simply throw away any faulty ones because it's not worth the hassle of getting replacements.
Of course, you won't find that exactly one in every 100 is faulty and the other 99 are perfect. You might find a whole box of faulty ones in the batch, or that half of the batch are faulty and by chance you just happened to have tested the good ones. It's all down to averages and random selection of the samples. What worried me as I watched the demos of data analysis with Hadoop-based methods was the assumption that, statistically, you could mistakenly rely on numbers that are really only averages or trends.
For example, one demo used the AdventureWorks sample data to calculate the number of bicycles sold in each zip code area and then mapped this to a dataset obtained from Windows DataMarket containing the average ages of people in each zip code area. The result was that in one specific area people aged 50 to 60 were most likely to buy a bicycle. So the next advertising campaign for AdventureWorks in that area should be aimed at the older generation of consumers.
I did some back-of-an-envelope calculations for our street and I reckon that the average age is somewhere around the 45 to 55 mark. Yet the only people I see riding a bicycle are the couple across the road who are in their 30s, a lady probably in her late 20s that lives at the other end of the street, and lots of young children. I rather doubt that an advert showing two gray-haired pensioners enjoying the freedom of the outdoors by cycling through beautiful countryside on their new pannier-equipped sit-up-and-beg bicycles would actually increase sales around here. Though perhaps one showing grandparents giving their grandkids flashy new racing bikes for Christmas would work?
Maybe "Big Data", Hadoop, and HDInsight do give us new ways to analyze the vast tracts of data that we're all collecting these days. But what's worrying is that, without applying some deep knowledge of statistical analysis techniques, will we actually get valid answers?