Secundus Posting (of sorts)

It’s been a while since there’s been any new content on here (well, let’s be honest, a very long time).  My role at Microsoft has shifted a bit in the last year, going down a slightly different direction and focusing on a new technology area.  I shifted over to covering StreamInsight on the SQL Customer Advisory Team, focusing on helping customers and partners with building large scale event driven applications.

After working on some incredible projects, and with the launch of StreamInsight as part of SQL Server 2008 R2, finally carving out enough time to talk about some of the best practices we’ve learned through launching the product, and share some (hopefully) valuable information about how to roll out your own StreamInsight implementations.

So, now that I’ve rambled on about a role change, let’s set the stage for the things of things that will wend their way to the web through this humble blog:

What is StreamInsight?

Let’s think about the wild world of analytics, and the value of impatience with an analogy.  Anybody who has ever been subjected to a presentation from me is familiar with my love of a good analogy (well, love of a really bad analogy, let’s be honest here), so this blog won’t really be much different.  Note: this scenario is completely contrived; we’ll get to the real ones later.

Suppose I had to answer a question “how many red cars are in the parking lot behind your office”.  Let’s look at how we’d answer this question, both in a physical manner, and how this problem would be addressed in a relational database:

Physical Approach

Relational Database

  • Walk out to the parking lot, with a clipboard and a pen.
  • Walk through the parking lot, and look at each vehicle
  • If the vehicle is a passenger car and is red, tick off on my clipboard.
  • Count up all of the ticks, head back into the office.
  • Connect to the target database
  • Execute some SQL command such as SELECT COUNT(*) FROM ParkingLot WHERE type = ‘AUTO’ AND color = ‘RED’

Seems pretty straightforward.  We can think about optimizing these queries by doing things such as parking all of the cars on one side of the lot, then parking them by color (or, in the relational world, adding an index).  For this class of questions, we have a great platform for delivering answers.  However, there’s another class of problems that aren’t well addressed by relational data engines.

Suppose I were asked a slightly different questions – how many cars have passed your office on the highway in the last 20 minutes.  If we were to approach answering this problem in a pure relational sense our analogy would be:

Physical Approach

Relational Database

  • Pull all of the cars off of the highway into a holding lot.
  • Leave each car there for at least twenty minutes, and keep track of who is allowed to leave.
  • Wander around counting up all of the cars every time one enters or leaves.
  • Load the incoming car records into the database.
  • Continuously execute some SQL command such as SELECT COUNT(*) FROM ParkingLot WHERE type = ‘AUTO’ AND color = ‘RED’
  • Have a separate job that periodically evicts cars (records) from the database.

All of a sudden this doesn’t seem like the most efficient way to provide the answer in either realm.  This is where StreamInsight steps in to help fill the gap – by providing a stream processing platform with the capabilities to address delivering insight about streaming data at scale.  In the StreamInsight world our new analogy is:

Physical Approach

StreamInsight

  • Stand by the side of the road, counting cars as they go past.
  • Set up a continuously running standing query along the lines of the LINQ statement below.
  • Use the StreamInsight engine to connect this query to an incoming stream of Vehicle observation events.
 var x = from e in cars
    where e.Type == "AUTO" && e.Color == "RED"
    select e;
 
var count = from e in x.HoppingWindow(
    TimeSpan.FromMinutes(2), TimeSpan.FromSeconds(30),
    HoppingWindowOutputPolicy.ClipToWindowEnd)
select new
{
    count = e.Count()
};

The LINQ statement above is a StreamInsight query that asks the question:

  • Look at a stream of data containing observations about vehicles (the cars stream).
  • Filter out any events that aren’t AUTO (mobiles) and painted RED.
  • Create a window in time, look at the last 2 minutes, and advance the window every 30 seconds.
  • In each window, count the number of events (i.e. the number of red cars in that 2 minute window), and report the aggregate.

Err.. wait.. how does this LINQ stuff magically solve my problem of observing data in flight?  Well, that’s some of the magic of StreamInsight.  It provides a platform for:

  • In-memory stream processing engine.
  • Connecting streaming data sources (adapters) to continuously running standing queries (queries).
  • Asking questions about temporal and relational data (windows in time, relationships in time), including dynamically chaining and composing queries.
  • Extensibility for queries, operators, aggregates and data connectivity.

What is it good for?

StreamInsight is designed to enable you to develop powerful event processing applications at scale.  This lends itself to a range of scenarios across various industries, such as:

Manufacturing
  • Process analytics, from plant floor sensors.  React through device controllers upon detecting exceptions, trends, etc.
Financial Services
  • Stock and news feeds, algorithmic trading.
Operational Analytics
  • Clickstream data, behavior analytics, user experience quality.
Utilities
  • Smart grid analysis, millions of power meters.  Identify trends, usage patterns, potential error conditions.

Basically, anything where you have large volumes of streaming data and need to process results with low latency

Sounds cool, how do I get started?

To get started you’ll want to be familiar with C# and LINQ, and have a copy of Visual Studio (2008 or 2010).  Once you’ve got the baseline in place:

  1. Download and install StreamInsight from the StreamInsight product page (or MSDN/TechNet if you’re a subscriber)
    1. 64-bit
    2. 32-bit
  2. Download the reference samples from codeplex onto your local hard drive.
    1. Since codeplex downloads (along with most internet downloads) are not trusted by default, you’ll need to “Unblock” the .zip file (best to do this before unzipping, doing it file by file is grossly annoying).
    2. Right click on the StreamInsightProductTeamSamples zip file, then click on Properties.
      image
    3. Click on Unblock to “trust” the file.
  3. From Visual Studio, open up the Applications\TrafficJoinQuery\TrafficJoinQuery.sln solution.
  4. This application uses simulated traffic data (from two .csv files contained in the solution), with the query listed in the CreateQueryTemplate function (in Program.cs).
  5. Press F5 to debug the application, and execute the query against the two simulated data streams.  This will result in output similar to the following:

Creating CEP Server Creating CEP Application Registering LINQ query template Registering Adapter Factories Registering bound query Start query INSERT 6/25/2009 12:00:00 AM +00:00 1 1001 18 INSERT 6/25/2009 12:00:00 AM +00:00 1 1003 19   

-- snip --

NSERT 6/25/2009 12:18:20 AM +00:00 2 1004 21 INSERT 6/25/2009 12:18:20 AM +00:00 2 1005 19 INSERT 6/25/2009 12:18:20 AM +00:00 2 1006 21 INSERT 6/25/2009 12:18:20 AM +00:00 3 1007 22 CTI 12/31/9999 11:59:59 PM +00:00

Diagnostic View for 'cep:/Server/EventManager': AllEventsCount: 18 AllEventsMemory: 417792 Diagnostic View for 'cep:/Server/PlanManager':

-- snip --

QueryTotalIncomingEventCount: 497 QueryTotalConsumedEventCount: 497 QueryTotalProducedEventCount: 192 QueryTotalOutgoingEventCount: 192 -- snip --

Showing the StreamInsight engine starting up, activating the query, pumping some output, then dumping diagnostic information.

 

Where else can I learn about this stuff?

I’ll be pumping out a series of blog posts over the coming months addressing specific aspects of building StreamInsight applications, including adapter development, query authoring, hosting StreamInsight, connecting with other technology platforms (such as AppFabric Cache).  Just to kick things off in the interim here’s a few great links: