SU Utilization Metric


In order to achieve low latency streaming processing, Azure Stream Analytics jobs perform all processing in memory. When running out of memory, the streaming job fails. As a result, for a production job, it’s important to monitor a streaming job’s resource usage, and make sure there is enough resource allocated, in order to keep the jobs running 24/7.
One important metric you can monitor the resource usage with is the SU % utilization metric. The metric is a percentage number ranging from 0% to 100%. For a streaming job with minimal footprint, the SU % Utilization metric is usually under 10%. It’s best to keep the metric below 80% to account for occasional spikes.

suutilization

You can set an alert on the metric.
See https://docs.microsoft.com/en-us/azure/monitoring-and-diagnostics/insights-alerts-portal

There are several factors contributing to the increase of SU % utilization.

    1. Stateful query logic
      One of the unique capability of Azure Stream Analytics job is to perform stateful processing, such as windowed aggregates, temporal joins, and temporal analytic functions. Each of these operators keep some states.

      • The state size of a windowed aggregate is proportional to the number of groups (cardinality) in the group by operator.
        For example, in SELECT count(*) from input group by clusterid, tumblingwindow (minutes, 5) query, the number associated with clusterid is the cardinality of the query.
        In order to ameliorate issues caused by high carnality, send events to Event Hub partitioned by clusterid, and scale out the query by allowing the system to process each input partition separately using the Partition By as shown:SELECT count(*) from input PARTITION BY PartitionId GROUP BY PartitionId, clusterid, tumblingwindow (minutes, 5)Once the query is partitioned out, it is spread out over multiple nodes. As a result, the number of clusterid coming into each node is reduced thereby reducing the cardinality of the group by operator.
        Event Hub partitions should be partitioned by the grouping key to avoid the need for a reduce step. Additional details are covered here. https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-overview.
      • The state size of a temporal join is proportional to the number of events in the temporal wiggle room of the join, which is event input rate multiply by the wiggle room size.
        The number of unmatched events in the join affect the memory utilization for the query. The following query is looking to find the ad impressions that generate clicks:SELECT id from clicks INNER JOIN, impressions on impressions.id = clicks.id AND DATEDIFF(hour, impressions, clicks) between 0 AND 10It is possible that lots of ads are shown and few people click on it and it is required to keep all the events in the time window. Memory consumed is proportional to the window size and event rate.
        To remediate this, send events to Event Hub partitioned by the join keys (id in this case), and scale out the query by allowing the system to process each input partition separately using the Partition By as shown:SELECT id from clicks PARTITION BY PartitionId INNER JOIN impressions PARTITION BY PartitionId on impression.PartitionId = clocks.PartitionId AND impressions.id = clicks.id AND DATEDIFF(hour, impressions, clicks) between 0 AND 10

        Once the query is partitioned out, it is spread out over multiple nodes. As a results the number of events coming into each node is reduced thereby reducing the size of the state kept in the join window.

      • The state size of a temporal analytic function is proportional to the event rate multiply by the duration.
        The remediation is similar to temporal join. You can scale out the query using PARTITION BY.
    2. Out of order buffer
      User can configure the out of order buffer size in the Event Ordering configuration pane. The buffer is used to hold inputs for the duration of the window, and reorder them. The size of the buffer is proportional to the event input rate multiply by the out of order window size. The default window size is 0.
      To remediate this, scale out query using PARTITION BY. Once the query is partitioned out, it is spread out over multiple nodes. As a results the number of events coming into each node is reduced thereby reducing the number of events in each reorder buffer.
    3. Input partition count
      Each input partition of a job input has a buffer. The larger number of input partitions, the more resource the job consumes. For each SU, Azure Stream Analytics can process roughly 1MB/s of input, so you may want to match ASA SU number with the number of partition of your Event Hub. Typically, 1SU job is sufficient for an Event Hub with 2 partitions (which is the minimum for Event Hub) If the Event Hub has more partitions, your ASA job consumes more resources, but not necessarily uses the extra throughput provided by Event Hub. For a 6SU job, you may need 4 or 8 partitions from the Event Hub. Using an Event Hub with 16 partitions or larger in an 1SU job often contributes to excessive resource usage, and should be avoided.
    4. Reference data
      Reference data in ASA are loaded into memory for fast lookup. With the current implementation, each join operation with reference data keeps a copy of the reference data in memory, even if you join with the same reference data multiple times. For queries with Partition By, each partition has a copy of the reference data, so the partitions are fully decoupled. With the multiplier effect, memory usage can quickly get very high if you join with reference data multiple times with multiple partitions.

When tuning your jobs to reduce SU % Utilization, and/or deciding how many SU to use for the job, the above factors should be considered.
At this time, the tuning process is more of a trial-and-error process. You can run the job with typical input, and examine the SU % Utilization metric to find out whether the number of SU allocated is sufficient. If 6SU still doesn’t meet your needs, you will need to consider partition your query with PARTITION BY as illustrated above, so you can distribute the partitions. For partition queries, it’s recommended to use multiple 6SUs for the job. The multiplier is the number of partitions in the Event Hub input. It’s possible to use a smaller multiplier, but buffers for each partition may add memory pressure. You will need to experiment , and see what works best for your job. For more information on how to scale out jobs, see
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-scale-jobs

Comments (0)

Skip to main content