Hadoop for .NET Developers: Understanding Azure Vault Storage

NOTE This post is one in a series on Hadoop for .NET Developers.

My explanation of Hadoop storage in this blog series has focused on HDFS. Hadoop abstracts its file system layer so that alternative storage options can be employed. With HDInsight in Azure, Azure Blob Storage is used as the underlying storage layer. The abstraction that allows Azure Blob Storage to serve as Hadoop’s storage layer is referred to as Azure Vault Storage (AVS).

Azure Blob Storage is a generic storage platform for cloud-based applications. Having HDInsight leverage this storage allows Hadoop to tap into a pre-existing robust storage platform (with its own replication mechanisms) and provides customers alternatives for working with their data. For example, you might load data into Azure Blob Storage, spin-up an HDInsight cluster on it to process the data, spin-down the HDInsight cluster while leaving the storage in place, and then access the files in Azure Blob Storage for some kind of custom application.

But the core concept behind Hadoop is to bring the workload to the data. With AVS, the data is no longer on the data nodes. While this adds some networking overhead to portions of the data processing process, it actually lowers overhead on other portions. To further accelerate interaction with AVS, a high-speed network is implemented between the Azure Blob Storage infrastructure and that of the HDInsight service. The net effect is that AVS performs as well as HDFS in the vast majority of cases and even better in some others. The trade-off is that the HDInsight cluster must be provisioned in the same Azure data center as the Azure Blob Storage (so that the high-speed network can be employed).