Analyzing Azure Table Storage data with HDInsight

HDInsight was optimized from the start to be able to quickly analyze data on Azure's blob storage service using Hadoop by using the WASB file system to expose the data there as a native Hadoop file system. But the spirit of Hadoop has always been to be able to analyze data wherever it is, so in this post I'll show you how to analyze data residing the more structured Azure table storage service. This would enable you to analyze tables that have millions/billions of rows in them using Hadoop/Hive to distribute that analysis to many compute nodes in Azure.


To follow along in this post, you'll need to:

  1. Have a valid Azure subscription.
  2. Install Azure PowerShell and connect to your subscription - follow this article for details on how to do that.
  3. Have a storage account in your subscription that's selected as the default in your PowerShell session.
  4. Have git and Maven installed and in your PATH in your PowerShell session.

Creating a table to analyze

You can skip this step if you already have a table you want to analyze (and hey if you're reading this chances are good that this is the case). But just to get a quick-and-dirty example going, here's is a simple PowerShell script to create an example table for us to analyze:

function InsertRow($table, [String]$partitionKey, [String]$rowKey, [int]$intValue)
  $entity = New-Object "Microsoft.WindowsAzure.Storage.Table.DynamicTableEntity" $partitionKey, $rowKey
  $entity.Properties.Add("IntValue", $intValue)
  $insertResult = $table.CloudTable.Execute(`
$exampleTable = New-AzureStorageTable tobehadooped
for ($p = 1; $p -le 10; $p++)
  for ($r = 1; $r -le 100; $r++)
    InsertRow $exampleTable "P$p" "R$r" $r

Setting up the cluster

Now that we have data we want to analyze, let's setup an HDInsight cluster to analyze it. In this tutorial I'll be using the extensions in my github project azure-tables-hadoop to be able to access Azure tables from Hive, so we'll need to get it and package it:

Update (6/27/2014): Since writing this post HDInsight and Hive have advanced forward in a not very backwards-compatible way. I've updated my git project so now it works with HDInsight 3.1 (version 0.0.3 of my jar). If you want the HDInsight 2.1-compatible version please git checkout hdp1compat before the mvn package step below. And if you want the HDInsight 3.0-compatible version please git checkout Hive12 instead.

git clone
cd .\azure-tables-hadoop
mvn package

After that, we need to create a cluster with this JAR accessible to Hive. To do that, we'll use the AdditionalLibraries option when creating the cluster that's documented here. So we'll need to create a container to put this JAR in, put it there, then create the cluster that points to it:

# Please customize the lines below to choose your own user name, password and cluster name
$creds = New-Object System.Management.Automation.PSCredential 'myUser',`
 ('Please ch00se a different password :)' | ConvertTo-SecureString -force -asplaintext)
$clusterName = "mycluster"

$hiveLibsContainer = New-AzureStorageContainer hiveontableslib
$azureJar = ".\target\microsoft-hadoop-azure-0.0.1.jar"
$azureJarBlob = Set-AzureStorageBlobContent -Container $hiveLibsContainer.Name `
  -Blob $(Split-Path $azureJar -Leaf) -File $azureJar -Force
$hiveConf = New-Object $('Microsoft.WindowsAzure.Management.' + `
$hiveConf.AdditionalLibraries = New-Object $('Microsoft.WindowsAzure' + `
$storageAccount = $(Get-AzureSubscription -Current).CurrentStorageAccountName
$hiveConf.AdditionalLibraries.StorageAccountName ="$"
$hiveConf.AdditionalLibraries.StorageAccountKey = (Get-AzureStorageKey $storageAccount).Primary
$hiveConf.AdditionalLibraries.StorageContainerName = $hiveLibsContainer.Name
$clusterContainer = New-AzureStorageContainer $clusterName
$clusterConf = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 1 |
  Set-AzureHDInsightDefaultStorage -StorageAccountName $storageAccount `
  -StorageAccountKey $(Get-AzureStorageKey $storageAccount).Primary -StorageContainerName $clusterContainer.Name
$clusterConf = $clusterConf | Add-AzureHDInsightConfigValues -Hive $hiveConf
$cluster = $clusterConf | New-AzureHDInsightCluster -Credential $creds `
  -Name $clusterName -Location $(Get-AzureStorageAccount $storageAccount).Location

If all goes well, this should take around 10 minutes then you'll have a newly minted single-node HDInsight cluster to play with.

Exploring the data using Hive

The easiest way to explore this data is to use Hive with its SQL-like query language (you can also use Map-Reduce directly or use Pig or other Hadoop ecosystem components, but I'll ignore those for this post). The main trick we use to expose Azure tables to Hive is to have a custom Storage Handler for it, so we need to use the STORED BY clause to specify that the table data resides in an Azure table. Then we use the TBLPROPERTIES clause to specify the information about the table.

$hiveQuery = "CREATE EXTERNAL TABLE ExampleTable(IntValue int)
  """"=""$((Get-AzureStorageKey $storageAccount).Primary)"");"
Out-File -FilePath .\HiveCreateTable.q -InputObject $hiveQuery -Encoding ascii
$hiveQueryBlob = Set-AzureStorageBlobContent -File .\HiveCreateTable.q -Blob "queries/HiveCreateTable.q" `
  -Container $clusterContainer.Name -Force
$createTableJobDefinition = New-AzureHDInsightHiveJobDefinition -QueryFile /queries/HiveCreateTable.q
$job = Start-AzureHDInsightJob -JobDefinition $createTableJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job

This should create the Hive table for us, so now we can finally explore it! For example, to get the average of the IntValue column:

$avgJobDefinition = New-AzureHDInsightHiveJobDefinition -Query "SELECT AVG(IntValue) FROM ExampleTable"
$job = Start-AzureHDInsightJob -JobDefinition $avgJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job
Get-AzureHDInsightJobOutput -JobId $job.JobId -Cluster $cluster.Name

Behind the scenes

Let me air out the dirty laundry and explain how this all works. The main way I expose Azure tables data to Hadoop is by having an InputFormat defined for it. An InputFormat for Hadoop has two main responsibilities:

  1. Figure out how to split the input into chunks that can be processed separately by individual mappers (thus getting the distributed processing power of Hadoop).
  2. For each split, figure out how to read the data.

The second responsibility for our purposes is just grunt work: we just use the excellent Azure Storage SDK to read the data and pass it on as WritableEntity objects (in Hive, the StorageHandler understands those and maps them to columns). The first one though is a bit trickier: there is no universally good way to partition data to get the most out of our cluster. Ideally, if you have a 100 nodes (each with 4 mapper slots) processing your data, you want at least 400 splits among your mappers. Also, you want each mapper's data to fit in memory, so you don't want too many rows per mapper (also because if a mapper runs for an hour then fails it'll have to redo that all over again). But you don't want each mapper to have too little to do either: then your job run time will be dominated by overhead from starting/stopping all these mappers. So in short: it's an art to tune this partitioning step, and I certainly don't have it right by default for most real-world situations. By default: I query for all the partition keys in the table (one-by-one, so it's a slow process if there are many keys), then split the table by assigning each mapper to one of those values. If you want to customize the partitioning for your data, you can do that by using the configuration knob 'azure.table.partitioner.class' to specify a custom partitioner that you can provide.

Hope that helps! Please leave comments here and/or create issues/pull requests on the project in github if you have any suggestions/feedback.

Comments (28)
  1. Hi there,

    Interesting stuff. Haven't had a go at this yet (I will do soon) but in the meantime, do you know any way of INSERTing data into Azure table Storage using Hive?



  2. Hi Jamie – thanks for your interest. Hive extensibility does allow INSERT into externally stored data like this, but unfortunately it's still a TODO in the Azure tables code I wrote up so it won't work with that (and I don't personally know of others who implemented INSERT functionality). Hopefully I'll get around to implementing it soon, but if you feel up to it feel free to dig in, code it up and send a pull request.


  3. Cool, thank you Mostafa.

  4. Justin says:

    This is really an interesting article. Just wondering how HDInsight team adds the support of Azure Storage to HIVE. Is it also through Storage Handler?

  5. @Justin: Thank you! Azure Blob Storage is exposed as a file system within Hadoop, so Hive can just use it like it uses any other Hadoop file system. See this article for details on how it works:…/hdinsight-use-blob-storage

  6. Tom McCartan says:

    I have been having some difficulty setting this up to query my Azure Table.  I keep getting: HiveException: Error in loading storage

    I set up a new HDInsight Cluster (2.2), got the GIT code and Packaged it.  It created the: microsoft-hadoop-azure-0.0.2.jar

    I then copied this JAR file everywhere in my shared blob storage for Hive.  

    I then remote to the Azure HDInsight Cluster server.  Copied the JAR file to the local file server in Azure and modified the hive-site.xml file to add the file path property reference.  (But I don't know how to restart hive to pick up these settings?)

    Any help in figuring out why Hive can't find the JAR?


  7. Steve says:

    I haven't even touched this yet, but great article to get my head around HDInsight and its place with Azure tables (I actually understand this one).  

  8. huan li says:

    works well with hdinsight 3.0 but not working with hdinsight 3.1

    JVM throw method not found on entitypropertity inspector

  9. @Tom: this is probably because I've since updated the project to HDI 3.0 (and now to HDI 3.1, see my response to Huan below)…

    @Huan: thanks for pointing it out! It turns out there were some minor API breaking changes between Hive 0.12 (in HDI 3.0) and Hive 0.13 (in HDI 3.1). I've updated my code to comply with the latest changes, and created a tag in git, Hive12, for anyone who wants the HDI 3.0 version. I'll update the post to reflect all that…

  10. rob says:

    Thank you so much, this worked great. I was using 3.0 HDINsight, and I didn't notice a tag or a branch for it like you mentioned, but I was able to get the code at a specific version and compile it and go

  11. rob says:

    how hard would it be to add in support for having the paritionkey and rowkey in the external table as well? currently only properties are able to be in the external table.


  12. @rob: glad it worked for you. Adding the PartitionKey and RowKey was pretty easy so I added it (commit d57064e). Please sync to the newer version and try it out – or if you want it on the older HDInsight versions just diff that commit and apply it to the older version.

  13. Vinit says:

    Nice article 🙂

    Just curious to know how much difference we see in terms of performance while reading data from blob storage(assuming its written in delimited compressed format) vs azure table storage? As in case of blob storage, we are reading a file system but with table storage handler, we are still reading 1000 records in one call and will be receiving complete data using continuation token. It may affect the data read performance?

  14. Fred says:

    Job fails when i calculate average.

    UI only shows following error msg.

    Description: An exception occurred while processing your request. Additionally, another exception occurred while executing the custom error page for the first exception. The request has been terminated.

  15. Balanc3 says:

    This is amazing man. However, after creating the table I get:

    In Powershell:

    Object reference not set to an instance of an object.

    In Hive:

    Number of reduce tasks is set to 0 since there's no reduce operator


    I think the table might be empty. It seems to not have gotten the data from my table. I'm not sure though. Any thoughts?


  16. Roger says:

    I am getting an error where hive cannot recognize entity property other than partitionKey and rowKey.

    Is any one getting the same error?

  17. Victor says:

    Is there any way to run this connector to the local Azure storage emulator from the local HdInsight emulator?  Thank You!

  18. Brian Lockwood says:

    Tom McCartan:

    did you try using the set command to set the config value for the query? then that way you might need to reboot the head node.…/LanguageManual+Cli

    I have not tried this, just an idea.

  19. Chris says:

    Hive has the "PARTITIONED BY" clause that allows one to partition the information hive caches according to the values in one or more "columns", making queries much faster.  When I try this with your (very nice) code, my table ends up with no rows (it has over 14k rows without that clause).  Do you think using a custom InputFormat/RecordReader is incompatible with "PARTITIONED BY"?  Am I missing something?

  20. Fred says:

    When creating an external table can I not select Timestamp?

  21. chopramanish says:

    I created a HDInsight cluster (version 3.2) and included microsoft-hadoop-azure-0.0.5.jar as additional library.

    I then created an external Hive table pointing to Azure table storage.

    This worked fine. But when I issued a Select statement against the Hive table, I got the error:

    Exception in thread "main" java.lang.NoClassDefFoundError: com/microsoft/windowsazure/storage/StorageException

    Has anyone  come across this error?

  22. chopramanish says:

    I haven't been able to figure out how to create an external table with Timestamp as one of the columns. Can someone please share an example.


  23. chopramanish says:

    I was able to create a table with timestamp by specifying timestamp as the column type:

    CREATE EXTERNAL TABLE TestTableHive(PartitionKey string, RowKey string, Timestamp timestamp)

  24. Amr says:

    is the   installed by default on newly created  Hdinsight nodes ?  or this is something we have to get and install ?  

  25. Cedric Legendre says:


    I've got an error too: ClassNotFoundException Class

    I run the cluster in HD 3.1.

    Has anyone  come across this error?

  26. lightsailpro says:

    I am using Hortonworks hdp 2.2.4 on linux with microsoft-hadoop-azure-0.0.5.jar. I can create the external table fine. But I got the following error when running select. Could you help?

    Exception in thread "main" java.lang.NoClassDefFoundError: com/microsoft/windowsazure/storage/StorageException

           at java.lang.Class.getDeclaredConstructors0(Native Method)

           at java.lang.Class.privateGetDeclaredConstructors(

           at java.lang.Class.getConstructor0(

           at java.lang.Class.getDeclaredConstructor(

           at org.apache.hadoop.util.ReflectionUtils.newInstance(

           at org.apache.hadoop.hive.ql.exec.FetchOperator.getInputFormatFromCache(

           at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(

           at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(

           at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(

           at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(

           at org.apache.hadoop.hive.ql.Driver.getResults(

           at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(

  27. Eduard says:

    With the new ARM commandlets above doesn't work anymore, any pointers on how to do it for ARM?

  28. Nick McCollum says:

    Looks like the syntax for Add-AzureHDInsightConfigValues has changed. It appears to take a hashtable now for the -HiveEnv or -HiveSite parameters (-Hive no longer exists). Cannot get the syntax correct for passing in the path to the .jar file now. Any help would be appreciated. Thanks.

Comments are closed.

Skip to main content