Pushing Data from a Hortonworks Cluster to an Azure HDInsight Cluster

I have a scenario where a customer wishes to explore a move from an existing Hortonworks (HDP) cluster to an Azure HDInsight (HDI) cluster. The customer is interested in the lower administrative overhead of HDInsight’s Platform-as-a-Service offering as well as the ability to scale-out and scale-back cluster resources to match demand, something that’s challenging to do with the current architecture.

The first step in this is to move a subset of the HDP cluster’s data to the HDI environment to enable testing. Assuming none of the files I wish to transfer are individually over 195 GB in size, I can move this data into an Azure Storage account ahead of any HDI cluster provisioning. Once I’m ready for the HDI cluster, it’s a simple matter to provision the cluster against the Azure Storage account (which HDI will then interpret as part of its file system).

To get started, I provision an Azure (Resource Manager) Storage account in the Azure region (location) within which I intend to deploy the eventual HDI cluster. Once provisioned, I create a container within the Storage account to serve as the destination for the files from the HDP cluster. I record the Storage account and container names as well as one of the Account keys.

NOTE If using the Azure Portal to create your Storage account, you can create a container by clicking on the Blobs item on the default Storage Account tile. In the resulting page, click the +Add item and provide the container a name. Leave the container configured as Private unless you have a reason to do otherwise.

With the target Storage account in place, I can now configure the HDP cluster to connect to it. The easiest way I have found to do this is using Ambari:

  1. Login to the Ambari portal and navigate to the (default) dashboard
  2. Select HDFS from the left-hand navigation
  3. On the resulting HDFS page, click on the Configs tab
  4. On the resulting page, select the Advanced option
  5. Scroll down and expand the Custom core-site node
  6. Select Add Property… from the bottom of the expanded node
  7. Enter fs.azure.account.key.<account name>.blob.core.windows.net, substituting the name of the Storage account for <account name> , as the Name of the property
  8. Enter the Account key as the Value of the property
  9. Click the Add button and verify the new property appears under the Custom core-site node
  10. Locate the notification bar at the top of the current page
  11. Click the Save button on the notification bar to push changes to the cluster
  12. Follow any remaining prompts to complete the Save process

Once the Save process is completed, Ambari will indicate a restart of some services is required. Click the Restart button and Restart All Affected from the resulting drop-down. Follow any remaining prompts and monitor the process until it is successfully completed.

Now that the HDP cluster is configured to speak to the Azure Storage, it can be referenced using the WASB (unencrypted) or WASBS (encrypted) protocol. The syntax for Azure Storage reference under these protocols is:

  • wasb:// <container name> @ <account name> .blob.core.windows.net/
  • wasbs:// <container name> @ <account name> .blob.core.windows.net/

where <container name> is the name of the container you previously created and <account name> is the name of the Storage account. More information on these protocols is found here.

By opening an SSH connection on the name node of the HDP cluster, I can now use DISTCP to copy files from HDFS to the Azure Storage account:

     hadoop distcp /mydir/mysubdir/ wasbs://mycontainer@mystorageaccount.blob.core.windows.net/mydir/mysubdir/

NOTE Should I need to transfer Hive or HBase tables, I can export these using their associated EXPORT commands – Hive EXPORT and HBase EXPORT. (For the HBase Export, scroll to item 131.9 in the referenced document.) HBase provides still more options, e.g. ExportSnapshot and CopyTable.