How to pass Hadoop configuration values for a job on HDInsight

I came across the question a few times recently from several customers– "how do we pass hadoop configurations at runtime for a mapreduce job or Hive Query via HDInsight PowerShell or .Net SDK?" I thought of sharing the answer here with others who may run into the same question. It is pretty common in Hadoop world to customize Hadoop configuration values that exist in the configuraion files like core-site.xml, mapred-site.xml, hive-site.xml etc., for a specific workload or specific job. Hadoop Configurations, in general, is a broad topic and there are many different ways (site-level, node-level, application level etc) of speciying Hadoop configurations and I don't plan to cover each of these. My focus is on run-time configuraions for a specific job or application. In order to specify Hadoop configuration values for a specific job or application, we typically use 'hadoop –conf' or 'hadoop –D' generic options, as shown in this apache documentation - for a MapReduce JAR, we use 'Hadoop jar -conf' or 'Hadoop jar -D'. In this blog, we will keep our focus on 'Hadoop jar –D' option and see how we can achieve the same capability on HDInsight, specifically from the HDInsight PowerShell or .Net SDK.

Let's take a look at a few examples.

'Hadoop jar –D' in Apache Hadoop:

With apache hadoop, if I wanted to run the wordcount Mapreduce example via hadoop command line and wanted to have the output of my mapreduce job as compressed, I could do something like this –

hadoop jar hadoop-examples.jar wordcount -Dmapred.output.compress=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec /input /wordcount/output

'Hadoop jar –D' on Windows Azure HDInsight:

For Windows Azure HDinsight (or Hortonworks Data Platform on Windows), the syntax is slightly different (using double quotes), the command would be something like this -

hadoop jar hadoop-examples.jar wordcount "-Dmapred.output.compress=true" "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" /input /wordcount/output

This is explained nicely in this HDinsight Forum Thread

So far, so good. But the above syntax is valid for Hadoop command line and, for HDInsight cluster, it requires user to RDP to HDInsight cluster head node. On HDInsight, we envision that most users would use HDInsight PowerShell or .Net SDK from a remote client or application server to run Hadoop MapReduce, Hive, Pig or .Net Streaming jobs to make it a part of a rich workflow.

Passing Hadoop configuration values for a job via HDInsight PowerShell:

The HDInsight Job defintion cmdlets New-AzureHDInsightMapReduceJobDefinition, New-AzureHDInsightHiveJobDefinition, Invoke-AzureHDInsightHiveJob and New-AzureHDInsightStreamingMapReduceJobDefinition have a parameter called "-Defines" that we can use to pass Hadoop configuration values for a specific job at run-time.

Here is a Powershell script with example for mapreduce and Hive jobs –

As you may have noted, the paremeter "-Defines" is a HashTable and you can specify multiple configuration values separated by semi-colon. By the way, HDInsight PowerShell cmdlets are now integrated with Windows Azure PowerShell and can be installed from here

Passing Hadoop configuration values for a job via HDInsight .Net SDK:

Similarly, the HDInsight .Net SDK classes MapReduceJobCreateParameters, HiveJobCreateParameters and StreamingMapReduceJobCreateParameters have a property called 'Defines' that we can use to pass Hadoop configuration values for a specific job. An example is shown in the code snippet below- I have included just the relevant code – for full example of using HDInsight .Net SDK to run hadoop jobs, please review our HDInsight documentation here

Passing Hadoop Configuration values via WebHcat REST API:

HDInsight PowerShell or .Net SDK use WebHcat (aka Templeton) REST API to submit a job remotely and leverages the Templeton define parameter for passing Hadoop job configuration values – the 'define' parameter is available as part of the REST API for Mapreduce, Hive and Streaming jobs. If you were wondering why certain job types, such as PIG, didn't have the "-Defines" parameter in the HDInsight PowerShell cmdlet or .Net SDK, the reason is, WebHcat or Templeton (v1) REST API does not have 'define' parameter for that job type.

In general, we recommend that you use the HDInsight PowerShell or .Net SDK to submit remote jobs via WebHcat/Templeton, because the SDK makes it easier for you and handles the underlyning REST API details. But, If you can't use HDInsight Powershell or .Net SDK for some reason and need to use direct REST API, here is an example of passing hadoop configuration values via WebHcat REST API, using Windows PowerShell. You can also use any utility, such as cURL, that can invoke REST API.

Persistent Hadoop configurations via HDInsight cluster customization:

I know our focus in this blog has been on run-time Hadoop configurations, but I do want to call out that if there are certain hadoop configurations you wanted to change from the default values for the HDInsight cluster and wanted to preserve the changes throughout the cluster lifetime, you can do this via cluster customization with HDInsight PowerShell or .Net SDK, as shown here. This approach works well for a short-lived cluster or elastic services where you would create a customized cluster with specific configurations, run your workload and then remove the cluster.

Also, as explained in Dan's blog, outside of cluster customization during the install time, any manual modification of the Hadoop configuration files or any other file won't be preserved when the Azure VM nodes get updated.

That's it for today. I hope you find it helpful!

@Azim (MSFT)