Provision an HDInsight Cluster with Tez as Default Hive Execution Engine

There is a lot you can do with Hadoop but I primarily use it to store data I want to loosely explore.  This means that I focus on working with data via Hive and the easiest way for me to bring up a cluster for this kind of work is to fire up Azure HDInsight.

Be default, Hive translates it's HiveQL statements into MapReduce jobs.  HDInsight employs this default so that if you want to have the HiveQL statements executed via Tez (in order to get a significant performance boost in most circumstances), you have to start your session by submitting this directive:

set hive.execution.engine = tez;

This isn't that big a deal but if you do submit a statement and have forgotten to issue the directive, you either have to kill your job or just wait it out.  It's a minor annoyance that's addressed by configuring the Azure HDInsight cluster to use Tez as it's default execution engine for Hive.

The only supported mechanism for this to create your Azure HDInsight cluster via PowerShell so here is a complete PowerShell script illustrating this process.  Obviously, I've removed passwords and the names of specific services I've provisioned but with the right replacements, this should work for you. Notice the highlighted statements are the ones focused on setting the default engine to Tez:

$location = "East US"

$storageAccountName = "mystorageaccount"
$storageContainerName = "forhdinsight"

$dbServer = "myazuredbserver"
$dbName = "mymetastoredb"
$dbUserName = "dbuser"
$dbPassword = "dbuserpwd"

$hdiDataNodes = 16
$hdiName = "myhdicluster"
$hdiVersion = "3.1"

$hdiUserName = "hdiuser"
$hdiPassword = "hdiuserpwd"

#Get Azure Storage Key
$storageAccountKey = Get-AzureStorageKey $storageAccountName | %{ $_.Primary }

#Get Credentials - you could also prompt with Get-Credential
$hdiSecurePassword = ConvertTo-SecureString $hdiPassword -AsPlainText -Force
$hdiCredential = New-Object System.Management.Automation.PSCredential($hdiUserName, $hdiSecurePassword)

$dbSecurePassword = ConvertTo-SecureString $dbPassword -AsPlainText -Force
$dbCredential = New-Object System.Management.Automation.PSCredential($dbUserName, $dbSecurePassword)

#Set Default Hive Execution Engine to Tez
$hiveConfig = new-object 'Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightHiveConfiguration'
$hiveConfig.Configuration = @{ "hive.execution.engine"="tez" }

#New HDI Cluster
New-AzureHDInsightClusterConfig -ClusterSizeInNodes $hdiDataNodes -HeadNodeVMSize Large |
Set-AzureHDInsightDefaultStorage -StorageAccountName "$storageAccountName.blob.core.windows.net" -StorageAccountKey $storageAccountKey -StorageContainerName $storageContainerName |
Add-AzureHDInsightMetastore -SqlAzureServerName "$dbServer.database.windows.net" -DatabaseName $dbName -Credential $dbCredential -MetastoreType HiveMetaStore |
Add-AzureHDInsightMetastore -SqlAzureServerName "$dbServer.database.windows.net" -DatabaseName $dbName -Credential $dbCredential -MetastoreType OozieMetaStore |
Add-AzureHDInsightConfigValues -Hive $hiveConfig |
New-AzureHDInsightCluster -Name $hdiName -Location $location -Credential $hdiCredential -Version $hdiVersion

If you are new to working with PowerShell for Azure, be sure to review this documentation.  Also, this script does not include any details on accessing your Azure subscription via PowerShell.  For that, be sure to review these steps.