Maximizing HDInsight throughput to Azure Blob Storage

The HDInsight service supports both HDFS and Windows Azure Storage (BLOB Service) for storing data. Using BLOB Storage with HDInsight gives you low-cost, redundant storage, and allows you to scale your storage needs independently of your compute needs. However, Windows Azure Storage allocates bandwidth to a storage account that can be exceeded by HDInsight clusters of sufficient size. If this occurs, Windows Azure Storage will throttle requests. This article describes when throttling may occur and how to maximize throughput to BLOB Storage by avoiding throttling.

Note: In HDInsight, HDFS is intended to be used as a cache or for intermediary storage. When a cluster is deleted, data in HDFS will be discarded. Data intended for long-term storage should be stored in Windows Azure Storage (BLOBS).

Overview

If you run a heavy I/O workload on an HDInsight cluster of sufficient size*, reads and/or writes may be throttled by Windows Azure Storage. Throttling can result in jobs running slowly, tasks failing, and (in rare cases) jobs failing. Throttling occurs when the aggregate load that a cluster puts on a storage account exceeds the allotted bandwidth for the storage account. To address this, HDInsight clusters have a tunable self-throttling mechanism that can slow read and/or write traffic to a storage account. The self-throttling mechanism exposes two parameters: fs.azure.selfthrottling.read.factor and fs.azure.selftthrottling.write.factor. These parameters govern the rate of read and write traffic from an HDInsight cluster to a storage account. Values for these parameters are set at job submission time. Values must be in the range (0, 1], where 1 corresponds to no self-throttling, 0.5 corresponds to roughly 1/2 the unrestricted throughput rate, and so on. Conservative default values for these parameters are set based on cluster size at cluster creation time ("conservative" here means that values are such that throttling is highly unlikely to occur at all, but bandwidth utilization may be below the allocated amount). To arrive at optimal values for the self-throttling parameters, you should turn on storage account logging prior to running a job, analyze the logs to understand if/when throttling occurred, and adjust the parameter values accordingly.

Note: We are currently working on ways for a cluster to self-tune its throughput rate to avoid throttling and maximize bandwidth utilization.

* The number of nodes required to trigger throttling by Windows Azure Storage depends on whether geo-replication is enabled for the storage account (because bandwidth allocation is different for each case). If geo-replication is enabled, clusters with more than 7 nodes may encounter throttling. If geo-replication is not enabled, clusters with more than 10 nodes may encounter throttling.

What is throttling?

Limits are placed on the bandwidth allocated to Windows Azure Storage accounts to guarantee high availability for all customers. Limiting bandwidth is done by rejecting requests to a storage account (HTTP response 500 or 503) when the request rate exceeds the allocated bandwidth. Windows Azure Storage imposes the following bandwidth limits on a single storage account:

  • Bandwidth for a Geo Redundant storage account (geo-replication on)
    • Ingress - up to 5 gigabits per second
    • Egress - up to 10 gigabits per second
  • Bandwidth for a Locally Redundant storage account (geo-replication off)
    • Ingress - up to 10 gigabits per second
    • Egress - up to 15 gigabits per second

Note that these limits are subject to change. For more information, see Windows Azure’s Flat Network Storage and 2012 Scalability Targets. For information about enabling or disabling geo-replication for a storage account, see How to manage storage accounts.

When will my cluster be throttled?

An HDInsight cluster will be throttled if/when its throughput rates to Windows Azure Storage exceed those stated above. Throughput, in turn, is dependent on the nature of the job being run. Perhaps the best way to understand in advance if a job will encounter throttling is by comparing it to a well-known workload, the Terasort benchmark. With the fs.azure.selfthrottling.read.factor and fs.azure.selftthrottling.write.factor  parameters each set to 1 (i.e. no self-throttling), HDInsight clusters generally encounter throttling during the Teragen and Teravalidate phases of the Terasort workload* under the following conditions:

  • Geo-replication for the storage account is on and the cluster has more than 15 nodes, or
  • Geo-replication for the storage account is off and the cluster has more than 31 nodes.

These numbers are for reference only. A cluster will only encounter throttling if the job that it is running produces throughput in excess of that allocated for the storage account.

* Run with 4 map slots and 2 reduce slots.

How do I know my cluster is being throttled?

Initial indications that a cluster workload is being throttled by Windows Azure Storage may include the following:

  • Longer-than-expected job completion times
  • A high number of task failures
  • Job failures (in rare cases). If this occurs, task-attempt error messages will be of the form “java.io.IOException … caused by com.microsoft.windowsazure.services.core.storage.StorageException: The server encountered an unknown failure: The server is busy.”

While the above are indications that your cluster is being throttled, the best way to understand if your workload is being throttled is by inspecting responses returned by Windows Azure Storage. Responses with response code (http status code) of 500 or 503 indicate that a request has been throttled. One way to collect WA Storage responses is to turn on storage logging (https://www.windowsazure.com/en-us/manage/services/storage/how-to-monitor-a-storage-account/#configurelogging).

How can throttling be avoided?

If you have a workload that encounters throttling, there are three ways avoid it:

  1. Reduce your cluster size
  2. Adjust the settings that control the cluster’s self-throttling mechanism
  3. Request an increase in bandwidth allocated for your storage account.

The sections below go into more detail.

Reduce your cluster size

The first question to answer in avoiding throttling by Windows Azure Storage is this: Do I need all the CPUs in my cluster? In many cases, the answer here might be yes (e.g. the Terasort benchmark), in which case you can skip this section. However, some workloads that are truly I/O dominant may not require the CPUs available in a large cluster. By reducing the number of nodes in your cluster, you can reduce the load on storage and (potentially) avoid throttling (in addition to saving money!).

Adjust settings that control self-throttling

The fs.azure.selfthrottling.read.factor and fs.azure.selftthrottling.write.factor settings control the rate at which an HDInsight cluster reads and writes to Windows Azure Storage. Values for these settings must be in the range (0, 1], where 1 corresponds to no self-throttling, 0.5 corresponds to roughly 1/2 the unrestricted throughput rate, and so on. Default values for these settings are determined at cluster creation time according to the following formulas (n = number of nodes in cluster):

fs.azure.selfthrottling.read/write.factor = 1, n <= 7

fs.azure.selfthrottling.read/write.factor = 32/(5n), n > 7

The formula for n > 7 is conservative, based on the “worst-case” scenario (for a storage account with geo-replication enabled) in which the throughput capacity for each node in the cluster is maximized. In practice, this is rare. You can override the default values for these settings at job submission time. Depending on your workload, you may find that increasing the value for either or both of these settings when you submit a job improves job performance. However, increasing the default value by too much may result in throttling by Windows Azure Storage.

How do high latencies affect the self-throttling mechanism?

One of the assumptions built into the self-throttling mechanism is that end-to-end request latencies are low (in the 500ms to 1000ms range). If this assumption does not apply, bandwidth utilization may be low and/or jobs may take longer-than-expected to complete. In this case, increasing the values for fs.azure.selfthrottling.read.factor and fs.azure.selfthrottling.write.factor (within the range of (0, 1]) may improve performance.

That’s it for today. I’d be interested in feedback on this feature, so please use the comments below. And, as I mentioned earlier, we are currently working on ways for a cluster to self-tune its throughput rate to avoid throttling and maximize bandwidth utilization without the need for any manual intervention.

Thanks.

-Brian