As Microsoft adds support for both proprietary and open source technologies for processing and analytics of streaming data, customers have been asking us how to choose between Spark Streaming and Azure Stream Analytics. Here is my perspective on this subject, given my close involvement in the development and adoption of both these technologies.
Let’s first start with the technologies.
There are 3 layers in most big data analytics systems.
- Programming model, or language
- On node runtime
- Distributed runtime
Spark Streaming’s programming model is through programmatic construction of a physical processing plan, in the form of a DAG. It’s inspired by DryadLinq (http://research.microsoft.com/en-us/projects/DryadLINQ/), a Microsoft research project. The use of custom expression is very natural in this programming model, because the code describing the DAG and the custom expression is written in the same language, and is in the same program.
Very recently, Spark Streaming introduced the use of application time in 2.0 release (link), and the operator is restricted to windowed aggregates. More advanced operators such as temporal joins and analytic functions still need to be built by users by using updateStateByKey or the new mapWithState operators. Both are very low level operators that leave the burden on the user to implement the temporal logic they want because they only deal with streaming raw events, and maintaining states on user’s behalf. The Azure Stream Analytics team, in collaboration with Office 365 Custom Fabric team, made an attempt to implement Azure Stream Analytics like operators on Spark Streaming 1.4. Only reorder, windowed aggregates, and temporal joins are implemented at this time. The work was presented during Spark Summit 2016 (link). Our goal in some ways is to bring the simplicity of Azure Stream Analytics to Spark streaming and to evangelize Azure Stream Analytics temporal processing abstractions.
Implementing these operators is not trivial because it’s not just the functional aspects one needs to consider, but should also take into account parameters such as memory usage and scalability. In essence, you are programming at a very low level, but if you want to fully control the processing behavior, or the operator you need doesn’t exist (e.g. session window), that’s what you have to do. Azure Stream Analytics doesn’t allow you have that level of control today. During Spark Summit 2016, more details about Spark Structured Streaming were revealed. The new Spark Streaming programming model is even more unified with batch and interactive processing, so Spark SQL can be used to express stream processing as well. However, these new functionalities are still not considered production ready by Databricks. There are outstanding issues such as state cleanup (to prevent from the state size growing forever), and partial result generation/aggregation for windowed aggregates, where more controls have to be exposed to the users to enable the desired streaming processing behavior.
On node runtime
Because Spark Streaming doesn’t provide many temporal operators out of the box, there is really no temporal runtime used. In Ver 2.0, they are moving towards sharing the same runtime as the batch operators through the use of Structured Streaming, so the same query optimizer (Catalyst) can be used, and the processing engine (Tungsten) can be shared.
Azure Stream Analytics being a streaming only solution, uses an on-node engine highly tuned for streaming processing called Trill (link) which is a fruit of many years of MSR research and learnings from production CEP engine, StreamInsight (shipped with SQL). The design point starts to diverge between Spark Streaming and Azure Stream Analytics from this point onwards (you will see more divergence in the distributed runtime).
Spark streaming and Azure Stream Analytics are not targeting exactly the same space. Spark is covering the spectrum of batch, streaming and interactive. One of the design decisions that Spark team made was to use the same underlying infrastructure to support all these 3 usage patterns.
Azure Stream Analytics only targets stream processing and stream analytics scenarios at this time, the design and technology it is built on is very much optimized for stream processing. For example, when it comes to enabling the use of application time, Azure Stream Analytics sort the incoming events by application time first using a reorder window. Subsequent temporal operators are performed on the temporally sorted events. The implementation and semantics is much cleaner, and the amount of memory required by the state size is much smaller. For instance, if you are computing aggregates at one-minute interval for one million groups, Azure Stream Analytics only needs to keep one million aggregates in memory; Spark Structured Streaming on the other hand, keeps historical aggregates in memory in order to handle events arriving out of order. Today, because there is no state cleanup policy allowed, all historical aggregates have to be kept in memory forever, which won’t work in a production system that keeps running for a long period of time. Another example is left outer join, which doesn’t exist in Spark Streaming yet, but if it exists, the NULL output for the left outer join needs to be either delayed or retracted if a late event arrives that creates a match for the join condition.
Upfront reordering in Azure Stream Analytics ensures that the operator has seen all events in the past, so output can be emitted right away. When it comes to implementing temporal analytic functions (e.g. Last function), out of order processing can make the algorithm quickly intractable, because the temporal logic has a strong dependency on the order of the events. This is especially true if you are searching for a temporal pattern in the events. Upfront reordering is often necessary to even ensure the correctness of the implementation. The downside of this approach - the whole processing pipeline is delayed by the duration of the reorder window, and additional memory needs to be used to sort the events.
Personally, I think the abstraction of RDD being the cornerstone of Spark has its weakness of not being able to be optimized for the usage pattern. RDD abstraction has resulted in tremendous amount of growth in the past for Spark streaming because of the promise that once code is written to operate on RDD, it can be used for all 3 usage patterns. RDD created the platform where all application level logic converges. For example, MLlib can be used for both batch and streaming. This is unlike Azure Stream Analytics, which has to call out Azure Machine Learning service explicitly to perform scoring. It would be interesting to see how much more mileage they will get out of it, as different optimization functions need to be applied to different usage patterns.
At the distributed runtime layer, Spark’s core processing unit is RDD. Everything boils down to processing one RDD after another. Stream processing is modeled the same way - events in a fixed wall clock duration are captured in a RDD, and processed by the pipeline. As a result, it’s difficult to apply back pressure when there is not enough compute resource. The input reader keeps producing RDDs even when the rest of the pipeline cannot keep up. Also, because the wall clock duration for each RDD is fixed (set by the user), the end to end latency cannot be shorter than that duration. However, if you shorten the duration too much (e.g. under 1 second), the number of RDDs scheduled to be processed will overwhelm Spark’s scheduler. In my testing the system could keep up with a 10 second duration just fine, but, anything under that causes growth in the event processing backlog, even with very minimal processing logic.
In order to maximize throughput, Azure Stream Analytics also processes events in batches. But, the batch size is variable. When there are more events queued, events are processed in large batches. When there are less events queued, smaller batches are processed to achieve lower latency. There is no scheduler for such batches in Azure Stream Analytics, it’s just a GetBatch/Processing loop and as a result the overhead is very small. At the distributed runtime layer, Azure Stream Analytics uses a generalized anchor protocol that is offset based, instead of being time based. This allows us to have multiple timelines in the event streams processed. This is a useful concept for IoT scenarios, because the clocks on the devices could be significantly out of sync. Spark Streaming doesn’t have that capability at this time, or you have to drop down to the lower level operators to do it with custom code, if it’s possible at all.
For processing resiliency against node failures, the newest Spark Structured Streaming uses a write ahead log (WAL) to track all state changes. Azure Stream Analytics on the other hand, writes full state checkpoint at a fixed interval. Either approach works well for some scenarios but not necessarily for every scenario. There is a Microsoft Research paper discussing the various recovery strategies for your reference (link).
It is definitely possible for Spark Streaming to become more sophisticated over time to overcome many of its shortcomings I’ve mentioned above. With proper level of investments, I can see that happen. At the same time, Azure Stream Analytics is also making significant strides as it grows to become a highly mature managed service, improving reliability, debugging experience, and enabling users to complete more end to end scenarios out of the box. For example, Power BI output is built-in for dashboard scenarios; geospatial support is in private preview for connected car scenarios. We are also going to add interactive event exploration experience (currently in private preview) to help users understand the data in order to write meaningful Azure Stream Analytics queries.
Ultimately, the choice of the technology depends on your scenario, but hopefully the analysis above can help you get started on the right foot!