Highly Distributed Insights

Why would you want to use a "Big Data" solution? It's a question that we've been trying to answer in the first chapter of our forthcoming p&p guide to Windows Azure HDInsight. For a long while, everything we found on the web and in the original HDInsight docs on the website talked just about the volume and unstructured nature of the data as the justification.

Meanwhile the docs and presentations from Hortonworks (who created the Hadoop implementation behind HDInsight), and existing books about Big Data, all have comparison tables for a relational database and Hadoop/HDFS. And all of these concentrate on the differences in data volume, disk write speed, handling unstructured data formats, the point at which a schema is applied to the data, and how query processing is distributed.

OK, so this is valid and useful information, but it all comes back to the suggestion that you should choose a Big Data solution such as HDInsight only when a relational database just can't cut it. You know how it is - your boss just sent you 20 Petabytes of data as text strings and he's happy to wait till next Tuesday for an answer to his question. This probably happens to lots of people every week.

But when you really start to think about it, and talk to people who actually know about enterprise databases, data warehouse systems, and BI (Hi, Graeme) you discover that almost anything real people want to do is most likely perfectly possible using the kit that's already running in their datacenter. SQL Server Parallel Data Warehouse (PDW) can cope with Terabytes of data, which realistically is all that most people will have. Let's face it, the total size of the census data for the whole of the U.S. is only a few hundred GB.

And the fact that Hadoop "moves the processing to the data instead of moving the data to the database engine" doesn't mean that much if the database has fibre connections to the data store. Yes, the fact that Hadoop does it as distributed parallel tasks might speed up queries, but PDW is built to do that, and there's typically no shortage of cores in modern processors. I'll accept that you need some serious hardware for SQL Server and PDW if you are doing anything more than pretending to be a DBA, but for most organizations that already do proper BI this is pretty much the case.

So, like many people, I was starting to wonder if this Big Data thing was just another fad that would be gone in a year's time. Is it actually "Big Hype", or even an updated version of the famous old IBM misquote: "I think there is a world market for maybe five Big Data systems."

Except until I watched a presentation by Microsoft technical fellow Dave Campbell. The point he made is that it's not really about any of the comparisons related to volume, or structure, or parallel processing, or distributed storage. These are just technical details. What it's really about is getting insights from data. Which is probably why they called it HDInsight.

When your mind is wandering while you're in the shower or eating your cornflakes, and you suddenly get hit by inspiration about some question you might be able to answer by querying all that data you keep collecting, you've discovered what Big Data is all about. If you need to go and see your data architect, DBA, or data steward to implement your inspiration they'll tell you it will take two weeks to design the query, a week to update the data models, three days to cleanse and validate the source data, and a day to set up the report. And it's perfectly possible that, after all this, the report won't show anything useful. Or it will show that you should have asked a different question.

If you are a Douglas Adams fan, no doubt you are already mumbling "Deep Thought" to yourself.

In the presentation Dave talked about how you simply fire up a new cluster in HDInsight, load the data, and ask the question. If the answer is useful, then you know what query you need to get your DBA to create in your data warehouse. If the answer isn't useful, but suggests that a different question might be interesting, then you go ahead and ask that question. And if there are no questions that provide useful answers, then the only thing you've lost is a few hours of your time and the hourly cost of the "pay only for what you use" Windows Azure HDInsight cluster.

And it's fairly safe to assume there are more than five organizations in the world that would find this useful...