Introduction to Apache Storm

The Apache Storm project delivers a platform for real-time distributed (complex event) processing across extremely large volume, high velocity data sets. By providing a simple, easy-to-use abstraction, Storm enables real-time analytics, online machine learning and operational/ETL scenarios that have previously been non-trivial to implement. In this post we will familiarize ourselves with the Storm platform, its architecture and the terminology you need to be successful. This post sets the stage form a new blog series that will jog through creating a real-time ETL project based on Storm.

History & Storm Use Cases

Apache Storm was originally developed by Nathan Marz and the BackType team. BackType was later acquired by Twitter, who open-sourced the project. Storm is designed to be fast (benchmarked at 1 million 100 byte messages per second per node on commodity hardware), scalable and reliable with unbounded streaming data. It is also easy to use as it supports a number of different programming languages (Java, Python, Ruby, JavaScript and C#). These characteristic make it a perfect platform for handling distributed real-time event processing. Some of the most common real-time processing scenarios currently in place are:

  1. Financial fraud detection
  2. Network outage and/or intrusion detection
  3. Preventative maintenance in manufacturing and transportation
  4. Application monitoring
  5. Customization or personalization of web content

Architecture

An Apache Storm architecture follows a master/slave architecture in a manner similar to Hadoop. The architecture consists of three different types of nodes who all function in different roles to deliver at-least once guaranteed message processing (exactly once message processing is also support through an abstract called Trident).

Drawing1

These nodes are:

  • Nimbus (the master node) – a single daemon processes responsible for deploying, managing, coordinating and monitoring topologies running on the cluster. The master node assigns or reassigns all work when failures occur and is also responsible for distributing any required files to the supervisor nodes
  • ZooKeepers – In a distributed environment, centralized event and information coordinate is critical to processes coordination. Apache ZooKeeper functions in this role within the architecture, storing state information such as work assignments and job status between Nimbus and Supervisor nodes.
  • Supervisors (the worker nodes) – a daemon processes that spawn new task and monitor the status of workers. When Nimbus assigns a task to a Supervisor, the Supervisor spawns a new JVM process to execute each job.

Terminology

Getting started with Storm requires that you familiarize yourself with the terminology you’ll encounter as you begin building a distributed streaming processing solution.

Topologies

Analogous to a job, such as those found in more traditional or batch processing systems. Topologies are made up of data streams, spouts and bolts and are deployed to and run by an Apache Storm cluster. Topologies can be either defined as transactional (strictly defined order of events) or non-transactional and run indefinitely.

Data Streams

An unbound stream of tuples (a collection of key/value pairs), which serve as the most basic data structure within storm.

Spouts

The entry point into a Storm topology. A spout connects to a source of data, such as a database or a service bus event queue, transforms the data into a tuple and then emits the tuple for consumption and processing by one or more bolts.

Bolts

Responsible for transforming and processing data within your topology. Bolts can receive one or more data streams as inputs from spouts or even other bolts and can produce zero, one or more data streams as output. Common activities within bolts are calculations, filtering, aggregations, joins and writes to external sources such as a traditional relational or no-sql databse or even trigger a connected application event through web sockets or SignalR. When taken together these pieces allow you to easily implement a range of real-time processing systems. For example, suppose we captured tweets or Twitter messages and stored them on a Service Bus queue. A simple topology would start with a spout reading the tweets from the queue and emitting them for processing as tuples into the data stream. Subsequent bolts could be used to store the raw tweet to Azure Blob Storage for later consumption by a Hadoop cluster, while a second bolt could add value to the tweets by determining the tweet’s sentiment. Aggregation could be done on some arbitrary field such as hashtag or location. That aggregated level of data could then be pushed to an HBase database as well as a real-time dashboard built using SignalR. A visual repesentation of the previously describe topology is seen below.

Drawing1

Getting Started

So you’ve made it through the overview and your itching to get started, but noticed that I never discussed what it takes to set-up a Storm cluster. Rather than an inadvertent oversight this was deliberate since Storm clusters can be easily provisioned as a service within Microsoft Azure. If you are familiar with provisioning an HDInsight cluster, the process is nearly identical. For those that are not familiar with Hadoop, you will find the Storm option under the Data Services and HDInsight within the add menu as seen below. Using the quick create, you can specify a cluster name, size, subscription, storage account and an admin password and be off an running. image

Wrap-Up

This post was foundational as we introduced Apache Storm. We reviewed the basic architecture and discussed common terminology you need to know to get started. Over the next couple of posts we will look at the steps necessary to build and deploy an example topology using C#. Till next time!

Chris