Hadoop for .NET Developers: Setting Up an Azure Cluster

NOTE This post is one in a series on Hadoop for .NET Developers.

For rapid provisioning and lack of long-term commitment, the cloud is an excellent place to try your hand with a multi-node Hadoop cluster. If you are an MSDN subscriber, Microsoft provides you access to cloud services as part of your benefits as described here which you can use to put a cloud-based Hadoop cluster in place. In Azure, Microsoft's cloud, Hadoop is delivered as a service named HDInsight.

Please note that consumption of cloud services incurs fees, and while an MSDN subscription covers these up to a pre-defined amount, you can exceed your allocation and possibly incur expense. For this series of posts, I will assume you are an MSDN subscriber and understand your benefits and liabilities as these relate to the consumption of Azure services. 

NOTE If you are not an MSDN subscriber, you can still gain access to Azure through a trail subscription. Details on this are here.

To get started, you first need to obtain access to the HDInsight service. As of the time of this writing, HDInsight is in preview and available upon request. To check whether or not you currently have access to HDInsight in Azure, do the following: 

  1. Navigate to the Azure homepage.
  2. Click PORTAL at the top of the page and login (if necessary) using your MSDN subscriber account.
  3. Once the portal page renders, review the items on the left-hand side of the page for HDInsight.

If the HDInsight item is present, you have access. If it is not, you need to request access. To request access:

  1. Navigate to this page
  2. Scroll to the HDInsight Service item on that page
  3. Click the associated TRY IT link and follow the steps on that page to request access.

Please note that obtaining access can take a while but once you have access, you can then setup an HDInsight cluster. Steps for quickly creating a cluster are found here. I personally prefer to create a custom cluster which requires me to: 

  1. Click on the HDInsight icon on the left-hand side of the Azure portal page.
  2. Click the CREATE AN HDINSIGHT CLUSTER link to bring up the New Services dialog.
  3. Click on the CUSTOM CREATE icon to bring up the NEW HDINSIGHT CLUSTER dialog. This dialog allows you to define your cluster in more detail.
  4. In the CLUSTER NAME textbox, enter a unique name for your cluster. An icon on the right of the textbox will indicate whether the name you provided is available.
  5. In the DATA NODES textbox, enter the number of data nodes required for your cluster. I tend to use 2 data nodes for development purposes but you can use as many or as few as you require. Please note, the number of nodes in the cluster has pricing implications.
  6. For our purposes, please leave the HDINSIGHT VERSION and REGION options at their defaults.
  7. Click the next icon to proceed.
  8. In the USER NAME textbox, enter a user name for the cluster.
  9. In the PASSWORD and CONFIRM PASSWORD textboxes, enter a very strong password for the user.
  10. At the time of writing, the ENTER HIVE/OOZIE METASTORE option is visible but not selectable so just click the next icon to proceed.
  11. Under STORAGE ACCOUNT, select CREATE NEW ACCOUNT. Other options are available, but these require a bit more understanding about Azure Vault Storage.
  12. Under ACCOUNT NAME, enter a name for the storage account.
  13. Under DEFAULT CONTAINER, enter a name for the default “folder” in the storage account under which the Hadoop components and data will be stored. Any valid name is allowed.
  14. Leave ADDITIONAL STORAGE ACCOUNTS set to 0 and click the next icon to proceed.

At this point, the provisioning of the services begins. This process can take several minutes to complete. You can view the progress of this step by clicking the DETAILS icon on the Creating Cluster bar at the bottom of the portal page. 

Once the process is complete, you have a working HDInsight cluster. The cluster consists of data nodes, a name node, and an associated storage account delivered through the Azure Storage service. The portal will show you the HDInsight cluster as soon as provisioning is completed but to see the storage account, you may need click the HOME link at the top of the portal and then click PORTAL from the Azure homepage to return to the portal page. The storage account should now be visible.

Once you are done with your HDInsight Azure cluster, you can delete it by returning to the Azure portal, locating the cluster, and clicking the associated DELETE icon. If the storage account associated with the cluster is not used for other purposes and you do not wish to access it further, you can delete it as well.