Developers often ask us: "What is an Azure Data Lake Analytics Unit? How does it affect my U-SQL job? How many do I need for my U-SQL job?" We will answer all these questions in this blgo post.
Introducing the Analytics Unit
An Azure Data Lake Analytics Unit (AU) is a unit of compute resources with Azure Data Lake. U-SQL jobs use compute resources in the form of AUs to execute their work. Each AU provides an amount of CPU cores and memory. Currently, an AU is the equivalent of 2 CPU cores and 6 GB of RAM.
In the future, as we see how developers want to use Azure Data Lake, we may change the definition of an AU. Or, we may provide more options for controlling CPU and memory usage.
How AUs are used during U-SQL Query Execution
When you submit a U-SQL job, you specify three things:
- The U-SQL script
- The input data that the U-SQL script will use - these are identified in the U-SQL script
- The number of AUs to reserve for executing the job
The U-SQL compiler & optimizer look at the U-SQL script and the input data creates a "plan" the execute the work. The plan itself is divided into smaller tasks - each task is a called a vertex. The smallest possible U-SQL job consists of one vertex. A large U-SQL job can consist of many thousands vertices.
When a job starts the AUs are allocated for the job in seconds. To run a vertex, an AU is assigned to that vertex. When the vertex is finished the AU will be assigned to another vertex. In this manner, the job is completed vertex-by-vertex. Of course, multiple vertices can be run at the same time, so multiple AUs may be in use at the same time. When the job is finished, all the AUs are instantaneously released for us by other jobs.
AUs versus Vertices
When a vertex is run, it uses the compute resources of 1 AU. If a developer specifies N AUs to to be allocated to the job, in effect this means that a maximum of N vertices can run at any given moment.
Consider a job with many vertices but that is allocated only 1 AU. This results in only one active vertex at any given point.
Now consider a job that has 10 vertices but is allocated 100 AUs. The job is considered "over-allocated"- because it could in theory only ever use 10 vertices at any given moment - so 90 AUs are being wasted.
What is an AU Second?
An AU Second is the unit used to measure the compute resources used for a job. For example:
- 1 AU allocated to a job that executes for 1 second is equivalent to an AU Second.
- 1 AU allocated to a job that executes for 100 seconds is equivalent to 600 AU Seconds.
- 10s AU allocated to a job that executes for 5 minutes seconds is equivalent to 3000 AU Seconds.
A U-SQL job is billed by the number of AUs seconds it consumes. The current price for a given AU second and related options are provided here .
Question: Will my U-SQL job run faster if I allocated more AUs?
Increasing the number of AUs makes more compute resources available to a job and the job could run faster. However, depending on your job’s characteristics (e.g. how parallelizable it is, how much data it is processing etc.), you may not always see a proportional reduction in job execution time.
How should you decide the right number of AUs to assign to your job?
In order to decide upon the right number of AUs to assign to your U-SQL job, you need to consider the following:
- The characteristics of your job – Will your job benefit from the additional AUs? This may not be easy to determine when you run this job for the first time. For smaller data sets, we recommend starting with allocating 1 AU for 1 GB of input data. Besides input data, your computation complexity also affects how many AUs can this job be parallelized to. For this, we provide rich tools for you to understand and fine-tune the number of AUs. In next blog, we will walk you through how to use Azure Data Lake Tools for Visual Studio to choose an optimal amount of AU.
- Business requirements and budget - If your job can benefit from additional AUs then you need to consider your business scenarios and costs. Is your business willing to pay more for this job in order to have it run faster?
As an example, let’s consider the following scenarios:
- You run one job with 100 AUs and it lasts 1 hour. Your job will cost you the equivalent of 100*60 = 6,000 AU minutes.
- You run the same job with 1000 AUs and depending on its characteristics, it takes 6 minutes to complete (10X faster than before). In this case, you still pay for the same number of AU minutes i.e. 1000*6=6,000 AU minutes. This seems like a good deal and you should consider increasing the AUs for this job.
- It is also possible, that when you run this job with 1,000 AUs, it takes 12 minutes to complete (5X faster than before). At the same time, the cost of the job has doubled to 1,000*12 = 12,000 AU minutes. You have to now decide whether having this job run 5X faster is worth the 2X increase in cost.
We hope that this blog answers some of your basic questions about AUs, how they impact jobs and how to decide on the number of AUs to assign to your job. In the very near future, we will post on more advanced topics like job scheduling, resource usage diagnosis with ADL tools for Visual Studio, etc.
In the meanwhile, please reach out to us @azuredatalake at twitter.