Customizing HDInsight Cluster provisioning

In my last blog, I discussed how we can specify Hadoop configurations for a job on an HDInsight cluster. At the end of that blog, I also dicussed the alternative approach where you may want to change certain hadoop configurations from default values and would like to preserve the changes throughout the lifetime of the cluster because, may be, the configurations have worked quite well for your workload during testing and apply to most of your jobs– you can do this via cluster customization while creating the HDInsight cluster. This approach also fits well with 'elastic hadoop in the cloud' scenario where you would create a customized HDinsight cluster with specific configurations, run your workload and then remove the cluster. While creating my own customized cluster, I realized that it was not very obvious from our existing documentation what different customization options are available or how to use those without digging through the reference documentation. In this blog, I wanted to share a few examples (a Powershell script and a .Net SDK example) with various customization options that can be used during HDInsight cluster provisioning.

Can we do it using Azure Portal?

The short answer is, yes – but with limitations. As shown in our HDInsight documentation, we can create a customized HDInsight cluster via our Azure Portal, Windows Azure Powershell or HDInsight .Net SDK. While I personally like the Azure Portal most for its simplicity and ease of use, not all the customization options are available via the portal, as of today – for example, customizing Hadoop configuration files or adding additional libraries or JARs during cluster provisioing, as shown in this codeplex example. Also, the UI restricts us to a certain number of additional storage accounts we can specify on the portal. The Windows Azure Powershell or HDInsight .Net SDK don't have such limitations and with these tools, you can use all the available customization options. Another benefit is, you can reuse the PowerShell script or .Net SDK code and make it part of your workflow.

The chart below shows a summary of a few important customizations that are available via portal, PowerShell and .Net SDK -

Example using Windows Azure PowerShell:

Here is a sample PowerShell script with examples of almost all the possible customization options during provisioning of a cluster. You can omit the customizations that you don't need.

 

Example using HDInsight .Net SDK:

Here is an equivalent cluster customization sample with HDInsight .Net SDK. Like before, omit the customizations you don't need.

 

Can we customize a cluster after Provisioning?

We can, but as explained in Dan's blog, outside of cluster customization during the install time, any manual modification of the Hadoop configuration files or any other file won't be preserved when the Azure VM nodes get updated - hence this is not recommended or supported. But the good news is, you can always customize or configure a Job and here are some of the possible options (not limited to)-

1. You can specify Hadoop configuration values for a job, as shown in this blog

2. You can use additional Azure Storage accounts (that are not associated with this HDInsight cluster) for a job, as shown in this TechNet article

3. You can upload a custom JAR to Window Azure Blob Storage and refer to that JAR from a job via MapReduce -libjars, Hive 'Add Jar' or Pig Register mechanisms.

That's all for today. I hope you find the blog helpful!

@Azim (MSFT)