Analyzing Azure Table Storage data with HDInsight

HDInsight was optimized from the start to be able to quickly analyze data on Azure's blob storage service using Hadoop by using the WASB file system to expose the data there as a native Hadoop file system. But the spirit of Hadoop has always been to be able to analyze data wherever it is, so in this post I'll show you how to analyze data residing the more structured Azure table storage service. This would enable you to analyze tables that have millions/billions of rows in them using Hadoop/Hive to distribute that analysis to many compute nodes in Azure.

Prerequisites

To follow along in this post, you'll need to:

  1. Have a valid Azure subscription.
  2. Install Azure PowerShell and connect to your subscription - follow this article for details on how to do that.
  3. Have a storage account in your subscription that's selected as the default in your PowerShell session.
  4. Have git and Maven installed and in your PATH in your PowerShell session.

Creating a table to analyze

You can skip this step if you already have a table you want to analyze (and hey if you're reading this chances are good that this is the case). But just to get a quick-and-dirty example going, here's is a simple PowerShell script to create an example table for us to analyze:

 function InsertRow($table, [String]$partitionKey, [String]$rowKey, [int]$intValue)
{
  $entity = New-Object "Microsoft.WindowsAzure.Storage.Table.DynamicTableEntity" $partitionKey, $rowKey
  $entity.Properties.Add("IntValue", $intValue)
  $insertResult = $table.CloudTable.Execute(`
    [Microsoft.WindowsAzure.Storage.Table.TableOperation]::Insert($entity))
}
$exampleTable = New-AzureStorageTable tobehadooped
for ($p = 1; $p -le 10; $p++)
{
  for ($r = 1; $r -le 100; $r++)
  {
    InsertRow $exampleTable "P$p" "R$r" $r
  }
}
        

Setting up the cluster

Now that we have data we want to analyze, let's setup an HDInsight cluster to analyze it. In this tutorial I'll be using the extensions in my github project azure-tables-hadoop to be able to access Azure tables from Hive, so we'll need to get it and package it:

Update (6/27/2014): Since writing this post HDInsight and Hive have advanced forward in a not very backwards-compatible way. I've updated my git project so now it works with HDInsight 3.1 (version 0.0.3 of my jar). If you want the HDInsight 2.1-compatible version please git checkout hdp1compat before the mvn package step below. And if you want the HDInsight 3.0-compatible version please git checkout Hive12 instead.

 git clone https://github.com/mooso/azure-tables-hadoop.git
cd .\azure-tables-hadoop
mvn package
        

After that, we need to create a cluster with this JAR accessible to Hive. To do that, we'll use the AdditionalLibraries option when creating the cluster that's documented here. So we'll need to create a container to put this JAR in, put it there, then create the cluster that points to it:

 # Please customize the lines below to choose your own user name, password and cluster name
$creds = New-Object System.Management.Automation.PSCredential 'myUser',`
 ('Please ch00se a different password :)' | ConvertTo-SecureString -force -asplaintext)
$clusterName = "mycluster"

$hiveLibsContainer = New-AzureStorageContainer hiveontableslib
$azureJar = ".\target\microsoft-hadoop-azure-0.0.1.jar"
$azureJarBlob = Set-AzureStorageBlobContent -Container $hiveLibsContainer.Name `
  -Blob $(Split-Path $azureJar -Leaf) -File $azureJar -Force
$hiveConf = New-Object $('Microsoft.WindowsAzure.Management.' + `
  'HDInsight.Cmdlet.DataObjects.AzureHDInsightHiveConfiguration')
$hiveConf.AdditionalLibraries = New-Object $('Microsoft.WindowsAzure' + `
  '.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightDefaultStorageAccount')
$storageAccount = $(Get-AzureSubscription -Current).CurrentStorageAccountName
$hiveConf.AdditionalLibraries.StorageAccountName ="$storageAccount.blob.core.windows.net"
$hiveConf.AdditionalLibraries.StorageAccountKey = (Get-AzureStorageKey $storageAccount).Primary
$hiveConf.AdditionalLibraries.StorageContainerName = $hiveLibsContainer.Name
$clusterContainer = New-AzureStorageContainer $clusterName
$clusterConf = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 1 |
  Set-AzureHDInsightDefaultStorage -StorageAccountName $storageAccount `
  -StorageAccountKey $(Get-AzureStorageKey $storageAccount).Primary -StorageContainerName $clusterContainer.Name
$clusterConf = $clusterConf | Add-AzureHDInsightConfigValues -Hive $hiveConf
$cluster = $clusterConf | New-AzureHDInsightCluster -Credential $creds `
  -Name $clusterName -Location $(Get-AzureStorageAccount $storageAccount).Location
        

If all goes well, this should take around 10 minutes then you'll have a newly minted single-node HDInsight cluster to play with.

Exploring the data using Hive

The easiest way to explore this data is to use Hive with its SQL-like query language (you can also use Map-Reduce directly or use Pig or other Hadoop ecosystem components, but I'll ignore those for this post). The main trick we use to expose Azure tables to Hive is to have a custom Storage Handler for it, so we need to use the STORED BY clause to specify that the table data resides in an Azure table. Then we use the TBLPROPERTIES clause to specify the information about the table.

 $hiveQuery = "CREATE EXTERNAL TABLE ExampleTable(IntValue int)
 STORED BY 'com.microsoft.hadoop.azure.hive.AzureTableHiveStorageHandler'
 TBLPROPERTIES(
  ""azure.table.name""=""$($exampleTable.Name)"",
  ""azure.table.account.uri""=""https://$storageAccount.table.core.windows.net"",
  ""azure.table.storage.key""=""$((Get-AzureStorageKey $storageAccount).Primary)"");"
Out-File -FilePath .\HiveCreateTable.q -InputObject $hiveQuery -Encoding ascii
$hiveQueryBlob = Set-AzureStorageBlobContent -File .\HiveCreateTable.q -Blob "queries/HiveCreateTable.q" `
  -Container $clusterContainer.Name -Force
$createTableJobDefinition = New-AzureHDInsightHiveJobDefinition -QueryFile /queries/HiveCreateTable.q
$job = Start-AzureHDInsightJob -JobDefinition $createTableJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job
        

This should create the Hive table for us, so now we can finally explore it! For example, to get the average of the IntValue column:

 $avgJobDefinition = New-AzureHDInsightHiveJobDefinition -Query "SELECT AVG(IntValue) FROM ExampleTable"
$job = Start-AzureHDInsightJob -JobDefinition $avgJobDefinition -Cluster $cluster.Name
Wait-AzureHDInsightJob -Job $job
Get-AzureHDInsightJobOutput -JobId $job.JobId -Cluster $cluster.Name
        

Behind the scenes

Let me air out the dirty laundry and explain how this all works. The main way I expose Azure tables data to Hadoop is by having an InputFormat defined for it. An InputFormat for Hadoop has two main responsibilities:

  1. Figure out how to split the input into chunks that can be processed separately by individual mappers (thus getting the distributed processing power of Hadoop).
  2. For each split, figure out how to read the data.

The second responsibility for our purposes is just grunt work: we just use the excellent Azure Storage SDK to read the data and pass it on as WritableEntity objects (in Hive, the StorageHandler understands those and maps them to columns). The first one though is a bit trickier: there is no universally good way to partition data to get the most out of our cluster. Ideally, if you have a 100 nodes (each with 4 mapper slots) processing your data, you want at least 400 splits among your mappers. Also, you want each mapper's data to fit in memory, so you don't want too many rows per mapper (also because if a mapper runs for an hour then fails it'll have to redo that all over again). But you don't want each mapper to have too little to do either: then your job run time will be dominated by overhead from starting/stopping all these mappers. So in short: it's an art to tune this partitioning step, and I certainly don't have it right by default for most real-world situations. By default: I query for all the partition keys in the table (one-by-one, so it's a slow process if there are many keys), then split the table by assigning each mapper to one of those values. If you want to customize the partitioning for your data, you can do that by using the configuration knob 'azure.table.partitioner.class' to specify a custom partitioner that you can provide.

Hope that helps! Please leave comments here and/or create issues/pull requests on the project in github if you have any suggestions/feedback.