Share via


Azure Vault Storage in HDInsight: A Robust and Low Cost Storage Solution

UPDATE: Windows Azure HDInsight has shipped. Azure Storage Vault (ASV) is now just blob storage. Note the following changes in the syntax used to access data in blob storage:

  • The asv:// syntax is being deprecated for wasb:// (WASB = Windows Azure Storage – BLOB service)
  • Both the asv:// and wasb:// syntax will work for now, but support for asv:// syntax will be removed at some point in the future.

Also the old console in the original post has been replaced. For the latest documentation on HDInsight, see https://www.windowsazure.com/en-us/documentation/services/hdinsight/

HDInsight is trying to provide the best of two worlds in how it manages its data.

Azure Vault Storage (ASV) and the Hadoop Distributed File System (HDFS)
implemented by HDInsight on Azure are distinct file systems that are optimized,
respectively, for the storage of data and computations on that data.

  • ASV is Windows Azure Blob Storage exposed as a file system that HDInsight understands how to read/write. This provides a highly scalable and available, low cost, long term, and shareable storage option for data that is to be processed using HDInsight.
  • The Hadoop clusters deployed by HDInsight on HDFS are optimized for running Map/Reduce (M/R) computational tasks on the data.

NOTE: The ASV term used currently by HDInsight is being deprecated in favor of WASB (Windows Azure Storage - BLOB service). Both are backed by blob storage, so this change will not impact existing data stored in ASV; only the syntax used changes. Where you currently use asv:// to access data, going forward you will use wasb://. Both currently work, but at some point in the future, the asv://syntax will be removed.

HDInsight clusters are deployed in Azure on compute nodes to execute M/R
tasks and are dropped once these tasks have been completed. Keeping the data in
the HDFS clusters after computations have been completed would be an expensive
way to store this data. ASV provides a full featured HDFS file system over
Azure Blob storage (ABS). ABS is a robust, general purpose Azure storage
solution, so storing data in ABS enables the clusters used for computation to
be safely deleted without losing user data. ASV is not only low cost. It has been
designed as an HDFS extension to provide a seamless experience to customers by
enabling the full set of components in the Hadoop ecosystem to operate directly
on the data it manages.

In the upcoming release of HDInsight on Azure, ASV will be
the default file system. In the current developer preview on www.hadooponazure.com data stored in
ASV can be accessed directly from the Interactive JavaScript Console by
prefixing the protocol scheme of the URI for the assets you are accessing with
ASV://

NOTE: The ASV:// syntax is being deprecated for wasb://
(WASB = Windows Azure Storage – BLOB service)

To use this feature in the current release, you will need
HDInsight and Windows Azure Blob Storage accounts. To access your storage
account from HDInsight, go to the Cluster and click on the Manage Cluster tile.

Click on the Set up ASV button.

 

Enter the credentials (Name and Passkey) for your Windows Azure Blob Storage account.


Then return to the Cluster and click on the Interactive Console tile to access the JavaScript console.

Now to run Hadoop wordcount job with data an ASV container name hadoop use
Hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar wordcount asv://hadoop/ outputfile

The scheme for accessing data in ASV is asv://container/path

To see the data in asv
#cat asv://hadoop2/data