Problems When Using a Shared Default Storage Container with Multiple HDInsight Clusters

We have seen several cases come in to Microsoft Support that ended up being caused by having multiple HDInsight clusters using the same Azure Blob Storage container for default storage. While we don't currently block you from creating clusters using the same default storage container, we do know that this can cause some specific problems. Many folks have been asking whether this configuration is supported, and the short answer is that it is not.

When it comes to determining whether a particular setup is supportable, we typically look at whether the configuration is tested and proven to work reliably. Since HDInsight is based on Apache Hadoop, this is obviously a bit more complex. If you look out into the Hadoop ecosystem there is not much precedence for primary storage being shared between multiple clusters. It just happens to be easy to manually configure HDInsight clusters in this way, and some customers have chosen to do so because it provides convenient access to shared data in the container. The problems may not manifest for many days or weeks, depending on some specific timing conditions on job completion and background maintenance, so it can appear to be working just fine for a while.

The types of problems that we have seen center around errors retrieving job status, which can cascade into unexpected errors, hangs or delays in Hive, Pig, WebHCat/Templeton, and Oozie. Each of these frameworks has different error handling and retry logic so the ways in which the problems surface are very broad.

What this means is that if you are using a shared default container between multiple HDInsight clusters and you call in to support, we will ask you to eliminate the shared default container configuration as a first troubleshooting step.

If you need to use a shared container to provide access to data for multiple HDInsight clusters then you should add it as an Additional Storage Account in the cluster configuration. This option is available when using the Azure Portal, PowerShell (Add-AzureHDInsightStorage), or the SDK (AdditionalStorageAccounts) to provision clusters.

Note: For detailed information about how HDInsight uses Blob storage check out: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/