How to use HDInsight from Linux

Article
02/18/2014

HDinsight is very easy to use from PowerShell, but how would you create and delete a cluster from Linux? How would you submit a job and get the result?

Here is is a simple sample and pointers to further documentation.

1. Create a cluster

You can create a cluster with the Windows Azure Command Line Interface (CLI).

In order to install the CLI, you can go to https://windowsazure.com, downloads. At the bottom of the page, you have two links: one for the CLI itself, the other one is the documentation.

Once you have installed it, you get an azure command line with many options.

The following bash script will create a cluster:

 #!/bin/bash
# create an HDInsight cluster

# more information at https://www.windowsazure.com/en-us/documentation/articles/hdinsight-administer-use-command-line/

defaultStorageAccount='monstockageazure'
storageAccount2='wasbshared'
clusterName='monclusterhadoop'
clusterContainerName='monclusterhadoop2'
clusterVersion='2.1'
clusterAdmin='cornac'
clusterConfigFile='./hdinsightCluster.config'

subscription='demos874F33876Y'

clusterPassword='YHqj6sq#ap9'
defaultStorageAccountKey='9O5uEqY1MsT6LIKifmXL0bQgrQElbslvu4N6mX58mSpPa4sPtYPTL5YjvLvcQAItuw87BdLulZWnGJWZ/VCd6Q=='
storageAccount2Key='7on846mc+5u9AItkVIEYz1OXwJZ86gN7o7ExURXO3qWJy+jNO56EtfUmRur+/qKkFGc4drA4GvBmhYGiBMlj3g=='

azure account set $subscription

azure hdinsight cluster config create $clusterConfigFile
azure hdinsight cluster config set $clusterConfigFile --clusterName $clusterName --nodes 3 --location "North Europe" --storageAccountName "$defaultStorageAccount.blob.core.windows.net" --storageAccountKey "$defaultStorageAccountKey" --storageContainer "$clusterName" --username "$clusterAdmin" --clusterPassword "$clusterPassword"
azure hdinsight cluster config storage add $clusterConfigFile --storageAccountName "$storageAccount2.blob.core.windows.net" --storageAccountKey "$storageAccount2Key"

azure hdinsight cluster create --config $clusterConfigFile

2. Submit a job

HDInsight exposes an Apache REST API called WebHCat (the former name was Templeton). This allows to submit jobs. It is documented at https://cwiki.apache.org/confluence/display/Hive/WebHCat.

There are tons of ways to call a REST API from Linux. The one I chose for this post is Python. For this sample, you install the “requests” module

 pip install requests

then you can run that script (02_submit_hive_job.py):

 import requests #https://pypi.python.org/pypi/requests

clusterName='monclusterhadoop'
clusterAdmin='cornac'
clusterPassword='YHqj6sq#ap9'

#get WebHCat status
webHCatUrl='https://' + clusterName + '.azurehdinsight.net/templeton/v1/status'

r = requests.get(webHCatUrl, auth=(clusterAdmin, clusterPassword))

print r.status_code
print r.json()

#submit a hive job:
# SELECT * FROM hivesampletable limit 10
# https://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/ds_HCatalog/hive.html

webHCatUrl='https://' + clusterName + '.azurehdinsight.net/templeton/v1/hive'

hive_params={'user.name':clusterAdmin,
             'execute':'SELECT * FROM hivesampletable limit 10',
             'statusdir': '/wasbwork/hive_from_python'}

r = requests.post(webHCatUrl, auth=(clusterAdmin, clusterPassword), data=hive_params)
print r.status_code
print r.json()

with the following command line:

python 02_submit_hive_job.py

In my case, I got the following result:

 benjguin@benjguinu2:~/dev/hdinsight_from_linux$ python 02_submit_hive_job.py
200
{u'status': u'ok', u'version': u'v1'}
200
{u'id': u'job_201402171346_0002'}

You can also get the status of the job, submit pig jobs, submit hive jobs from scripts you uploaded to Windows Azure Storage Blob. Here is a link to the documentation by Hortonworks:

https://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/ds_HCatalog/hive.html

and you get a table of contents on the left:

3. Get the result

In the Python script, as we asked the result to be at /wasbwork/hive_from_python, it is stored in the Windows Azure Storage Blob or wasb (in HDInsight, wasb is the default file system over HDFS which is also available at hdfs://namenodehost:9000/(…)). So, once the job is fiinished, and a script can figure it out with this REST API, you get the following files:

So, you can get the result by downloading the result (with azure CLI) and see it with this bash script:

 #!/bin/bash

defaultStorageAccount='monstockageazure'
clusterName='monclusterhadoop'
defaultStorageAccountKey='9O5uEqY1MsT6LIKifmXL0bQgrQElbslvu4N6mX58mSpPa4sPtYPTL5YjvLvcQAItuw87BdLulZWnGJWZ/VCd6Q=='

export AZURE_STORAGE_ACCOUNT="$defaultStorageAccount"
export AZURE_STORAGE_ACCESS_KEY="$defaultStorageAccountKey"

azure storage blob download $clusterName wasbwork/hive_from_python/stdout
cat wasbwork/hive_from_python/stdout

In my case, this gave the following result:

 benjguin@benjguinu2:~/dev/hdinsight_from_linux$ ./03_get_result.sh
info:    Executing command storage blob download
+ Download blob wasbwork/hive_from_python/stdout in container monclusterhadoop to wasbwork/hive_from_python/stdout
Percentage: 100.0% (809.00B/809.00B) Average Speed: 809.00B/S Elapsed Time: 00:00:00
+ Getting Storage blob information
info:    File saved as wasbwork/hive_from_python/stdout
info:    storage blob download command OK
8       18:54:20        en-US   Android Samsung SCH-i500        California      United States   13.9204007      0       0
23      19:19:44        en-US   Android HTC     Incredible      Pennsylvania    United States   NULL    0       0
23      19:19:46        en-US   Android HTC     Incredible      Pennsylvania    United States   1.4757422       0       1
23      19:19:47        en-US   Android HTC     Incredible      Pennsylvania    United States   0.245968        0       2
28      01:37:50        en-US   Android Motorola        Droid X Colorado        United States   20.3095339      1       1
28      00:53:31        en-US   Android Motorola        Droid X Colorado        United States   16.2981668      0       0
28      00:53:50        en-US   Android Motorola        Droid X Colorado        United States   1.7715228       0       1
28      16:44:21        en-US   Android Motorola        Droid X Utah    United States   11.6755987      2       1
28      16:43:41        en-US   Android Motorola        Droid X Utah    United States   36.9446892      2       0
28      01:37:19        en-US   Android Motorola        Droid X Colorado        United States   28.9811416      1       0

4. Remove the cluster

In order to remove the cluster, the azure CLI will also help:

 #!/bin/bash

clusterName='monclusterhadoop'

azure hdinsight cluster delete $clusterName

this produces the following sample result:

 benjguin@benjguinu2:~/dev/hdinsight_from_linux$ ./04_removeCluster.sh
info:    Executing command hdinsight cluster delete
+ Removing HDInsight Cluster
info:    hdinsight cluster delete command OK
benjguin@benjguinu2:~/dev/hdinsight_from_linux$

Conclusion

This post only shows a few simple examples. The goal is to show the principles that can be used. The azure CLI is used to manage the cluster itself, and may also be used to interact with Windows Azure Storage blobs. Submitting jobs can be done with WebHCat REST calls.

Smile

Benjamin (@benjguin)