HDInsight working with different storage accounts

Storage accounts - configured and otherwise

When you create an HDInsight cluster, whether through the Azure portal or programmatically through e.g. PowerShell, you get a chance to add extra storage accounts to the cluster. In PowerShell it looks something like this:

 $clusterConfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 4
$allAccounts = @('myfirstaccount','mysecondaccount','mythirdaccount')
$clusterConfig = Set-AzureHDInsightDefaultStorage $clusterConfig `
  -StorageAccountName $allAccounts[0] `
  -StorageAccountKey $(Get-AzureStorageKey $allAccounts[0]).Primary `
  -StorageContainerName 'mycontainer'
$allAccounts |
  Select-Object -Skip 1 |
  %{$clusterConfig = Add-AzureHDInsightStorage $clusterConfig `
  -StorageAccountName $_ `
  -StorageAccountKey $(Get-AzureStorageKey $_).Primary}

What that code snipppet does is setup the default storage account/container (where the cluster's data will go by default), but also configures additional storage accounts that the cluster can access in its jobs later.

A common question that we get though is: I already have a cluster configured, but I want to read some data from (or less frequently, write some data to) a storage account that I didn't setup in this way when I created my cluster - is that possible? The answer is: it's complicated, which is why I'm dedicating this post to a detailed explanation of that slippery answer.

So when you say "configured" - what do you mean exactly?

Before jumping into what's possible or not, it really helps to explain what exactly happens when a storage account is configured. Within the WASB driver that accesses data in Azure blob storage, when we try to access data within a container in a storage account, we go through a series of decisions based on our available configuration to determine how we'll try to access that:

  1. If we have a Shared Access Signature for the container, we'll use that.
  2. Else if we have a key for the whole account, we'll use that.
  3. Else we'll try to get in with anonymous access, and hope the container is a public container.

The key phrase in the preceding part is "available configuration". When a cluster is created with some storage accounts like shown in the introduction, the HDInsight infrastructure does some plumbing work so that the account keys are placed (encrypted, more on that later) in the "core-site.xml" file on every node in the cluster, which is the configuration file loaded by default by almost all Hadoop ecosystem components, so that these keys are part of the available configuration for almost everything you do in HDInsight. But if you know how Hadoop configuration works, you'll know that this is not the only way things can get into the available configuration. For every job you launch, you're allowed to define whatever you want in that job's configuration, which means that you can sneak in additional storage accounts' keys in there!

Alright, show me that word count!

So let's try this out. The easiest way to try this is through a Map-Reduce job, and instead of writing a new job, I'll ask you to bear with me through yet another usage of the well-worn WordCount example. If you want to follow through this example, you'll need an Azure subscription, the Azure SDK installed with that subscription selected as default in PowerShell, an HDInsight cluster, and a storage account that wasn't added to that cluster when it was created. I'll assume the following input variables are defined in your PowerShell:

 # Inputs
$clusterName = 'yourclustername'
$wantedAccountName = 'youraccount'
$wantedContainerName = 'yourcontainer'
$localPlayDirectory = 'C:\LocalPlay'

where $localPlayDirectory is just a directory we'll use for files we upload to/download from the cluster, and the rest are self-explanatory. With that out of the way, we need to get the account key, create a storage context we'll use for it, and make sure the container is there:

 # Preliminaries: Get the key, create the context, create the container
$wantedAccountKey = $(Get-AzureStorageKey $wantedAccountName).Primary
$wantedAccountContext = New-AzureStorageContext `
  -StorageAccountName $wantedAccountName `
  -StorageAccountKey $wantedAccountKey
$container = Get-AzureStorageContainer $wantedContainerName `
 -Context $wantedAccountContext `
 -ErrorAction SilentlyContinue
If ($container -eq $null)
{
  $container = New-AzureStorageContainer -Context $wantedAccountContext `
    -Name $wantedContainerName
}

And we'll create a simple input blob in the container that we'll count the words for:

 # Preliminaries: Create an input blob
$localPlayDirectory = md $localPlayDirectory -Force
$inputFile = "$localPlayDirectory\simple.txt"
'This is an interesting word count demo. It even has two of This' |
  Out-File $inputFile -Encoding ascii -Force
$inputBlobName = 'input/simple.txt'
$inputBlob = Set-AzureStorageBlobContent -Context $wantedAccountContext `
  -Container $wantedContainerName -Blob $inputBlobName -File $inputFile

I'll also define a function that I'll use to run jobs later:

 function Run-PlayJob($jobDefinition, $outputDirectory = $null)
{
  If ($outputDirectory -ne $null)
  {
    # Delete the output directory if it exists from a previous run
    $outputBlobs = Get-AzureStorageBlob -Context $wantedAccountContext `
      -Container $wantedContainerName `
      -Prefix "$outputDirectory" | Remove-AzureStorageBlob
  }
  # Run the job and wait for it to finish
  $job = Start-AzureHDInsightJob -Cluster $clusterName `
    -JobDefinition $jobDefinition
  $job = Wait-AzureHDInsightJob -Job $job
  # Get the stderror to get the diagnostics output
  Get-AzureHDInsightJobOutput -Cluster $clusterName `
    -JobId $job.JobId -StandardError
  If ($outputDirectory -ne $null)
  {
    # Download all the output blobs
    $outputBlobs = Get-AzureStorageBlob -Context $wantedAccountContext `
      -Container $wantedContainerName `
      -Prefix "$outputDirectory/part-r" |
      Get-AzureStorageBlobContent -Context $wantedAccountContext `
      -Destination $localPlayDirectory -Force
    # Dump them into stdout
    Get-ChildItem "$localPlayDirectory\$outputDirectory" |
      Get-Content
  }
  Else
  {
    # No output directory, get the stdout of the job
    Get-AzureHDInsightJobOutput -Cluster $clusterName `
      -JobId $job.JobId -StandardOutput
  }
}

OK finally we finish the preliminaries and get to the crux of this. Here is how we define a Word Count example that reads the input from an alien storage account, and writes its output there as well:

 # Map-Reduce example
$fullAccountName = "$wantedAccountName.blob.core.windows.net"
$wordCountDef = New-AzureHDInsightMapReduceJobDefinition `
  -JarFile '/example/jars/hadoop-mapreduce-examples.jar' `
  -ClassName 'wordcount' `
  -Arguments "wasb://$wantedContainerName@$fullAccountName/$inputBlobName", `
   "wasb://$wantedContainerName@$fullAccountName/mroutput" `
  -Defines @{"fs.azure.account.key.$fullAccountName"=$wantedAccountKey}
Run-PlayJob $wordCountDef 'mroutput'

If you've been following along, running the above should give you a nice word count for this file that wasn't in the cluster's configured accounts, which is awesome! A few things to note:

  1. Note the "-Defines" section, where we defined the key for our wanted account. We could've defined even more keys for more accounts, or we could've also defined shared access signatures instead.
  2. Note the usage of fully qualified WASB URI's for the input/output files so we can specify files outside the default account/container.

OK, but can Pigs fly?

Well if that's all there is to it, why did I say in the beginning the answer is complicated? Seems so far to be a simple "yes", right? Well, it would be if HDInsight only exposed pure Map-Reduce, but of course there are other things in there that you can run. So would this work for everything? Well, let's try pig!

Fortunately I can just reuse all the preliminaries I did in the Map-Reduce section, and jump right to the Pig job definition here:

 # Pig example
$pigScript = "SET fs.azure.account.key.$fullAccountName '$wantedAccountKey';
LOADED = LOAD 'wasb://$wantedContainerName@$fullAccountName/$inputBlobName';
TOKENIZED = foreach LOADED generate flatten(TOKENIZE((chararray)`$0)) as word;
GROUPED = group TOKENIZED by word;
COUNTED = foreach GROUPED generate COUNT(TOKENIZED), group;
store COUNTED into 'wasb://$wantedContainerName@$fullAccountName/pigoutput';"
$pigDef = New-AzureHDInsightPigJobDefinition `
 -Query $pigScript.Replace("`n", "").Replace("`r", "")
Run-PlayJob $pigDef 'pigoutput'

OK first try it out if you're a brave person following hands-on. Back? You should've seen this succeed with flying colors as well, because Pig is indeed very amenable to this treatment. The code looks a bit different but it's essentially the same idea: we use SET instead of -Defines to pass the configuration value we want, and we still use the same fully qualified URI's. So far so good.

Hive, where our lucky streak runs out

Stil not seeing the "it's complicated" reasoning? Well, now we get to why it's not all butterflies and rainbows. Let's try this with Hive:

 # Hive example - doesn't work unless metastore has the key
$hiveScript = "SET fs.azure.account.key.$fullAccountName=$wantedAccountKey;
DROP TABLE IF EXISTS FromMyAccount;
CREATE EXTERNAL TABLE FromMyAccount
(AllOfIt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'wasb://$wantedContainerName@$fullAccountName/input';
SELECT * FROM FromMyAccount;"
$hiveDef = New-AzureHDInsightHiveJobDefinition `
 -Query $hiveScript.Replace("`n", "").Replace("`r", "")
Run-PlayJob $hiveDef

If you try running this, you'll finally hit the dreaded error:

Failed with exception java.io.IOException:org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: Container yourcontainer in account youraccount.blob.core.windows.net not found, and we can't create it using anoynomous credentials.

What this is saying really is that WASB didn't find the key for the account in its configuration (the reason it doesn't just say that should be more understandable if you follow its reasoning I outlined in the beginning: it's in the last part of that chain where it's trying to access the data using anonymous credentials). But why? We use SET to provide this configuration, what's it complaining about? The reason lies in the stateful nature of Hive. Unlike Pig (which is stateless), Hive keeps metadata in a service called metastore (backed by a SQL database). The problem is that the "CREATE TABLE" statement runs in the metastore service, which doesn't get the SET statement so doesn't get the configuration from there. There is to the best of my knowledge no supported way to pass this configuration value to the metastore service, so unfortunately the only supported way to get that scenario going is to drop and recreate your cluster with that account configured from the start.

Now, you'll notice I qualified that statement with "supported". If you really want to live on the edge and get this to work temporarily, you can still do it. I won't show detailed step-by-step, but roughly:

  1. Enable Remote on the cluster (can only be done in the Azure portal.)
  2. Log into the headnode. Make sure it's the active headnode not the passive one (you can check by which services are running, or by the "Hadoop Service Availability" page on the desktop). If it's the passive one, you need to log out, modify the RDP file you downloaded to point to the other headnode instance, and go there.
  3. Modify the core-site.xml under the "C:\apps\dist\hadoop-xxx\etc\hadoop" directory and add the key configuration in there.
  4. Restart-Service metastore
  5. Rerun the script from above, it should work now.

This really isn't the best idea for the long run though: the minute this VM is reimaged by Azure, or for whatever other reason HDInsight switches to the other headnode, this will stop working.

Update (9/9/2014): Thanks to Ashit Gosalia from the HDInsight team for pointing me to this awesome article by Eric Hanson that describes an ingenious way around this problem: instead of hitting the metastore service that doesn't have this key configured, we can bypass it and just tell Hive to get the metadata from a SQL Azure instance directly! This does require you setup your own SQL Azure DB to back this up and configure it as a metastore, so putting it together it looks like this:

 # More inputs
$sqlAdminUser = '{fill in}'
$sqlAdminPassword = '{fill in but avoid exclamation marks}'
$metastoreDatabaseName = 'ExternalMetastore'

# Preliminaries
$location = $(Get-AzureHDInsightCluster $clusterName).Location
$sqlScriptLocation = `
 'https://raw.githubusercontent.com/apache/hive/branch-0.13/' + `
 'metastore/scripts/upgrade/mssql/hive-schema-0.12.0.mssql.sql'
$sqlScript = $(Invoke-WebRequest $sqlScriptLocation).Content

# Create SQL Azure metastore DB
$sqlServer = New-AzureSqlDatabaseServer -Location $location `
 -AdministratorLogin $sqlAdminUser -AdministratorLoginPassword $sqlAdminPassword
$sqlDatabase = New-AzureSqlDatabase -ServerName $sqlServer.ServerName `
 -DatabaseName $metastoreDatabaseName -Collation 'SQL_Latin1_General_CP1_CI_AS' `
 -Edition Web
$AllowEveroneFirewallRule = New-AzureSqlDatabaseServerFirewallRule `
 -ServerName $sqlServer.ServerName -RuleName AllowEveryone `
 -StartIpAddress 0.0.0.0 -EndIpAddress 255.255.255.255
$AllowAzureFirewallRule = New-AzureSqlDatabaseServerFirewallRule `
 -ServerName $sqlServer.ServerName -RuleName AllowAzure -AllowAllAzureServices
Invoke-Sqlcmd -Query $sqlScript `
 -ServerInstance "$($sqlServer.ServerName).database.windows.net,1433" `
 -Username $sqlAdminUser -Password $sqlAdminPassword -Database $sqlDatabase.Name
             
# Hive script pointing to my blob
$hiveScript = "
DROP TABLE IF EXISTS FromMyAccount;
CREATE EXTERNAL TABLE FromMyAccount
(AllOfIt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 'wasb://$wantedContainerName@$fullAccountName/input';
SELECT * FROM FromMyAccount;"

# Defines for Metastore redirect and key
$connectionUrl = `
 "jdbc:sqlserver://$($sqlServer.ServerName).database.windows.net;" + `
 "database=$($sqlDatabase.Name);" + `
 "encrypt=true;trustServerCertificate=true;create=false"
$defines = @{
  "fs.azure.account.key.$fullAccountName" = $wantedAccountKey;
  "hive.metastore.uris" = "";
  "javax.jdo.option.ConnectionURL" = $connectionUrl;
  "javax.jdo.option.ConnectionUserName" = "$sqlAdminUser@$($sqlServer.ServerName)";
  "javax.jdo.option.ConnectionPassword" = "$sqlAdminPassword";
}
# Full Hive job definition
$hiveDef = New-AzureHDInsightHiveJobDefinition `
 -Query $hiveScript.Replace("`n", "").Replace("`r", "") `
 -Defines $defines

# Run the job
Run-PlayJob $hiveDef

Security implications

So in this last section I want to discuss some of the security implications of what I explained here. By default HDInsight encrypts the storage keys for all the initial accounts using a certificate which is generated for this cluster. This is good because these keys give wide access to everything in those accounts, so the more carefully guarded they are the better. Unfortunately this certificate isn't available outside the cluster so we couldn't do the same encryption ourselves, which means that whenever we passed those keys around we passed them in plain-text (the REST API is over SSL so it's not really plain-text over the wire). Also, all the job configuration is stored into job history files in the default container for the cluster, so after you run any of the example jobs here then look at your default container, you'll find a blob called something like "mapred/history/done/.../job_xxx_conf.xml" which contains those keys in plain sight (to anyone with access to that container).

So if you're running your job in someone else's cluster, you really need to assume they'll be able to access your keys if you use the techniques described here. Also even if this is your cluster, this is fairly sloppy key management to have your keys lying around like that in plain text (though I've been guilty of that myself for not-so-sensitive accounts).

A decent mitigation is to use shared access signatures instead of account keys. The basics are the same, except the configuration knob you want is "fs.azure.sas.youraccount.blob.core.windows.net" instead of "fs.azure.account.key.youraccount.blob.core.windows.net". This way if they key is compromised it's at least a granular key to a single container instead of the whole account. Of course for really sensitive data, you are probably better off just configuring the cluster with that account from the start so the keys are encrypted from the beginning.