General Cluster's Probably Not Last Stand

I don't know if General Custer ever made a last stand against the Apache, but I feel like I have. My Apache is, of course, the Hadoop one. Or, to be technically accurate, Microsoft Azure HDInsight. And, going on experience so far, this is unlikely to actually be the last time I do it.

After six months of last year, and about the same this year, it seems like I've got stuck in some Big Data related cluster of my own. We produced a guide for planning and implementing HDInsight solutions last year, but it's so far out of date now that we might as well have been writing about custard rather than clusters. However we have finally managed to hit the streets with the updated version of the guide before HDInsight changes too much more (yes, I do suffer odd bouts of optimism).

What's become clear, however, is how much HDInsight is different from the typical Hadoop deployment. Yes, it's Hadoop inside (the Hortonworks version), but that's like saying battleships and HTML are the same because they both have anchors. Or cats and dogs are the same because they both have noses (you can probably see that I'm struggling for a metaphor here).

HDInsight stores all its data in Azure blob storage, which seems odd at first because the whole philosophy of Hadoop is distributed and replicated data storage. But when you come to examine the use cases and possibilities, all kinds of interesting opportunities appear. For example, you can kill off a cluster and leave the data in blob storage, then create a new cluster over the same data. If you specify a SQL Database instance to hold the metadata (the Hive and HCatalog definitions and other stuff) when you create the cluster, it remains after the cluster is deleted and you can create a new cluster that uses the same metadata. Perhaps they should have called in Phoenix instead.

We demonstrate just this scenario in our guide as a way to create an on-demand data warehouse that you can fire up when you need it, and shut down when you don't, to save running costs. And the nice thing is that you can still upload new data, or download the existing data, by accessing the Azure blob store directly. Of course, if you want to get the data out as Hive tables using ODBC you'll need to have the cluster running, but if you only need it once a month to run reports you can kill off the cluster in between.

But, more than that, you can use multiple storage accounts and containers to hold the data, and create a cluster over any combination of these. So you can have multiple versions of your data, and just fire up a cluster over the bits you want to process. Or have separate staging and production accounts for the data. Or create multiple containers and drip-feed data arriving as a stream into them, then create a cluster over some or all of them only when you need to process the data. Maybe use this technique to isolate different parts of the data from each other, or to separate the data into categories so that different users can access and query only the appropriate parts.

You can even fire up a cluster over somebody else's storage account as long as you have the storage name and key, so you could offer a Big Data analysis service to your customers. They create a storage account, prepare and upload their source data, and - when they are ready - you process it and put the results back in their storage account. Maybe I just invented a new market sector! If you exploit it and make a fortune, feel free to send me a few million dollars...

Read the guide at https://msdn.microsoft.com/en-us/library/dn749874.aspx