Access Azure Blob Stores from HDInsight

Small Bites of Big Data

Edit Mar 6, 2014: This is no longer necessary for HDInsight - you specify the storage accounts when you create the cluster and the rest happens auto-magically. See https://blogs.msdn.com/b/cindygross/archive/2013/11/25/your-first-hdinsight-cluster-step-by-step.aspx or https://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx.

One of the great enhancements in Microsoft's HDInsight distribution of Hadoop is the ability to store and access Hadoop data on an Azure Blob Store. We do this via the HDFS API extension called Azure Storage Vault (ASV). This allows you to persist data even after you spin down an HDInsight cluster and to make that data available across multiple programs or clusters from persistent storage. Blob stores can be replicated for redundancy and are highly available. When you need to access the data from Hadoop you point your cluster at the existing data and the data persists even after the cluster is spun down.

Azure Blob Storage

Let's start with how your data is stored. A storage account is created in the Azure portal and has access keys associated with it. All access to your Azure blob data is done via storage accounts. Within a storage account you need to create at least one container, though you can have many. Files (blobs) are put in the container(s). For more information on how to create and use storage accounts and containers see: https://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/. Any storage accounts associated with HDInsight should be in the same data center as the cluster and must not be in an affinity group.

You can create a container from the Azure portal or from any of the many Azure storage utilities available such as CloudXplorer. In the Azure portal you click on the Storage Account then go to the CONTAINERS tab. Next click on ADD CONTAINER at the very bottom of the screen. Enter a name for your container, choose the ACCESS property, and click on the checkmark.

AzureStorageContainers_thumb AzureStorageManageContainers_thumb AzureStorageNewContainer_thumb

HDInsight Service Preview

When you create your HDInsight Service cluster on Azure you associate your cluster with an existing Azure storage account in the same data center. In the current interface the QUICK CREATE doesn't allow you to choose a default container on that storage account so it creates a container with the same name as the cluster. If you choose CUSTOM CREATE you have the option to choose the default container from existing containers associated with the storage account you choose. This is all done in the Azure management portal: https://manage.windowsazure.com/

Quick Create: image

Custom Create: image

You can then add additional storage accounts to the cluster by updating C:\apps\dist\hadoop-1.1.0-SNAPSHOT\conf\core-site.xml on the head node. This is only necessary if those additional accounts have private containers (this is a property set in the Azure portal for each container within a storage account). Public containers and public blobs can be accessed without the id/key being stored in the configuration file. You choose the public/private setting when you create the container and can later edit it in the "Edit container metadata" dialog on the Azure portal.

image StorageContainerEdit

The key storage properties in the default core-site.xml on HDInsight Service Preview are:

<property>
  <name>fs.default.name</name>
  <!-- cluster variant -->
  <value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
  <description>The name of the default file system.  Either the
literal string "local" or a host:port for NDFS.</description>
  <final>true</final>
</property>

<property>
  <name>dfs.namenode.rpc-address</name>
  <value>hdfs://namenodehost:9000</value>
</property>

<property>
  <name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
  <value>YOURActualStorageKeyValue</value>
</property>

To add another storage account you will need the Windows Azure storage account information from https://manage.windowsazure.com. Log in to your Azure subscription and pick storage from the left menu. Click on the account you want to use then at the very bottom click on the "MANAGE KEYS" button. Cut and paste the PRIMARY ACCESS KEY (you can use the secondary if you prefer) into the new property values we'll discuss below.

AzureStorageAccounts_thumb AzureStorageContainerKeys_thumb1 AzureStorageMgKeys_thumb1

Create a Remote Desktop (RDP) connection to the head node of your HDInsight Service cluster. You can do this by clicking on the CONNECT button at the bottom of the screen when your HDInsight Preview cluster is highlighted. You can choose to save the .RDP file and edit it before you connect (right click on the .RDP file in Explorer and choose Edit). You may want to enable access to your local drives from the head node via the "Local Resources" tab under the "More" button in the "Local devices and resources" section. Then go back to the General tab and save the settings. Connect to the head node (either choose Open after you click CONNECT or use the saved RDP).

image image

On the head node make a copy of C:\apps\dist\hadoop-1.1.0-SNAPSHOT\conf\core-site.xml in case you have to revert back to the original. Next open core-site.xml in Notepad or your favorite editor.

Add your 2nd Azure storage account by adding another property.

<property>
  <name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
  <value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Repeat for each storage account you need to access from this cluster.

HDInsight Server Preview

If you have downloaded the on-premises HDInsight Server preview from https://microsoft.com/bigdata that gives you a single node "OneBox" install to test basic functionality. You can put it on your local machine, on a Hyper-V virtual machine, or in a Windows Azure IaaS virtual machine. You can also point this OneBox install to ASV. Using an IaaS VM in the same data center as your storage account will give you better performance, though the OneBox preview is meant purely for basic functional testing and not for high performance as it is limited to a single node. The steps are slightly different for on-premises as the installation directory and default properties in core-site.xml are different.

Make a backup copy of C:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf\core-site.xml from your local installation (local could be on a VM).

Edit core-site.xml:

1) Change the default file system from local HDFS to remote ASV

<property>
  <name>fs.default.name</name>
  <!-- cluster variant -->
  <value>hdfs://localhost:8020</value>
  <description>The name of the default file system.  Either the
literal string "local" or a host:port for NDFS.</description>
  <final>true</final>
</property>

to:

<property>
  <name>fs.default.name</name>
  <!-- cluster variant -->
  <value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
  <description>The name of the default file system.  Either the
literal string "local" or a host:port for NDFS.</description>
  <final>true</final>
</property>

2) Add the namenode property (do not change any values)

<property>
  <name>dfs.namenode.rpc-address</name>
  <value>hdfs://namenodehost:9000</value>
</property>

3) Add the information that associates the key value with your default storage account

<property>
  <name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
  <value>YOURActualStorageKeyValue</value>
</property>

4) Add any additional storage accounts you plan to access

<property>
  <name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
  <value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Files

Upload one or more files to your container(s). You can use many methods for loading the data including Hadoop file system commands such as copyFromLocal or put, 3rd party tools like CloudXPlorer, JavaScript, or whatever method you find fits your needs. For example, I can upload all files in a data directory (for simplicity this sample refers to C: which is local to the head node) using the Hadoop put command:

hadoop fs -put c:\data\ asv://data@sqlcatwomanblog.blob.core.windows.net/

Or upload a single file:

hadoop fs -put c:\data\bacon.txt asv://data@sqlcatwomanblog.blob.core.windows.net/bacon.txt

To view the files in a linked non-default container or a public container use this syntax from a Hadoop Command Line prompt (fs=file system, ls=list):

hadoop fs -ls asv://data@sqlcatwomanblog.blob.core.windows.net/

Found 1 items
-rwxrwxrwx   1        124 2013-04-24 20:12 /bacon.txt

In this case the container data on the private storage account sqlcatwomanblog has one file called bacon.txt.

For the default container the syntax does not require the container and account information. Since the default storage is ASV rather than HDFS (even for HDInsight Server in this case because we changed it in core-site.xml) you can even leave out the ASV reference.

hadoop fs -ls asv:///bacon.txt
hadoop fs -ls /bacon.txt

More Information

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon.

Note: the Preview, CTP, and TAP programs are available for a limited time. Details of the usage and the availability of the pre-release versions may change rapidly.

Digg This