“Cloud Numerics” Example: Analyzing Air Traffic “On-Time” Data

 

You sit at the airport only to witness your departure time get delayed. You wait. Your flight gets delayed again, and you wonder “what’s happening?” Can you predict how long it will take to arrive at your destination? Are there many short delays in front of you or just a few long delays? This example demonstrates how you can use “Cloud Numerics” to sift though and calculate a big enough cross section of air traffic data needed to answer these questions. We will use on-time performance data from the U.S. Department of Transportation to analyze the distribution of delays. The data is available at:

https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time

This data set holds data for every scheduled flight in the U.S. from 1987 to 2011 and is —as one would expect— huge! For your convenience, we have uploaded a sample of 32 months —one file per month with about 500,000 flights in each— to Windows Azure Blob Storage at container URI: https://cloudnumericslab.blob.core.windows.net/flightdata (note that you cannot access this URI directly in a browser, you must use a Storage Client C# APIs as in Step 2, or a REST API query).

Before You Begin

  • You will need to have Microsoft Codename “Cloud Numerics” installed.
  • Create a Windows Azure subscription for deploying and running the application (if do not have one already). You also need a Windows Azure storage account set up for saving results.
  • Install “Cloud Numerics” on your local computer where you develop your C# applications.
  • Run the example application as detailed in the “Cloud Numerics” Lab Getting Started wiki page to validate your installation.

You should use two-to-four compute nodes when deploying the application to Windows Azure. One node might not have enough memory, and for larger-sized deployments there are not enough files in the sample data set to assign to all distributed I/O processes. You should not attempt to run the application on a local system because of data transfer and memory requirements.

Note!
You specify how many compute nodes are allocated when you use the Cloud Numerics Deployment Utility to configure your Windows Azure cluster. For details, see this section in the Getting Started guide.

Step 1: Create Cloud Numerics project

First, let’s create a Cloud Numerics application using the Cloud Numerics Visual Studio project template.

1.  Start Microsoft Visual Studio and create a solution from the Cloud Numerics project template. Let’s call the solution “OnTimeStats.”

2.  Create a new application, and delete the sample code from within the MSCloudNumericsApp subproject’s Program.cs file. Replace it with the following skeleton of an application.

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using msnl = Microsoft.Numerics.Local;
using msnd = Microsoft.Numerics.Distributed;
using Microsoft.Numerics.Statistics;
using Microsoft.Numerics.Mathematics;
using Microsoft.Numerics.Distributed.IO;
using Microsoft.WindowsAzure;
using Microsoft.WindowsAzure.StorageClient;

namespace FlightOnTime
{    
    [Serializable]
    public class FlightInfoReader : IParallelReader<double>
    {

    }

    class Program
    {
        static void WriteOutput(string output)
        {

        }

        static void Main()
        {
            // Initialize runtime
            Microsoft.Numerics.NumericsRuntime.Initialize();

            // Shut down runtime
            Microsoft.Numerics.NumericsRuntime.Shutdown();
        }
    }
}

 

3.  Add a .NET reference to Microsoft.WindowsAzure.StorageClient. This assembly is used for reading in data from blobs, and writing results.

Step 2: Add Methods for Reading Data

We’ll use the IParallelReader interface to read in the data from blob storage in the following manner:

  • In the ComputeAssignment method, we must first list the blobs in the container that holds the flight data (https://cloudnumericslab.blob.core.windows.net/flightdata). Then, we assign blobs to each worker node, in round-robin fashion.
  • In the ReadWorker method, we instantiate a list “arrivalDelays”. Each reader will then open the first blob on their list, download the text, break it to lines, break the line to columns, and select column 44 that holds the arrival delays in minutes. Note that if a flight was canceled or diverted, this value is blank and we must skip the row. We convert the result to a double precision type and append it to the arrivalDelays list. Then, we either move to the next blob or –if done– convert the list into NumericDenseArray and return it.

For Step 2, add the following code to the FlightInforReader class in your skeleton application.

[Serializable]
public class FlightInfoReader : IParallelReader<double>
{

    string _containerAddress;

    public FlightInfoReader(string containerAddress)
    {
        _containerAddress = containerAddress;
    }

    public int DistributedDimension
    {
        get {return 0;}
        set {}
    }

    public Object[] ComputeAssignment(int ranks)
    {
        // Get list of flight info files (blobs) from container
        var container = new CloudBlobContainer(_containerAddress);
        var blobs = container.ListBlobs().ToArray();

        // Allocate blobs to workers in round-robin fashion
        List<Uri> [] assignments = new List<Uri> [ranks];
        for (int i = 0; i < ranks; i++)
        {
            assignments[i] = new List<Uri>();
        }

        for (int i = 0; i < blobs.Count(); i++)
        {
            int currentRank = i % ranks;
            assignments[currentRank].Add(blobs[i].Uri);
        }
        return (Object[]) assignments;
    }

    public msnl.NumericDenseArray<double> ReadWorker(Object assignment)
    {
        
        List<Uri> assignmentUris = (List<Uri>) assignment;

        // If there are no blobs to read, return empty array
        if (assignmentUris.Count == 0)
        {
            return msnl.NumericDenseArrayFactory.Create<double>(new long[] { 0 });
        }

        List<double> arrivalDelays = new List<double>();

        for (int blobCount = 0; blobCount < assignmentUris.Count; blobCount++)
        {
            // Open blob and read text lines
            var blob = new CloudBlob(assignmentUris[blobCount].AbsoluteUri);
            var rows = blob.DownloadText().Split(new char[] {'\n'});
            int nrows = rows.Count();

            // Offset by one row because of header file, also, note that last row is empty
            for (int i = 1; i < nrows - 1; i++)
            {
                // Remove quotation marks and split row
                var thisRow = rows[i].Replace("\"", String.Empty).Split(new char[] { ',' });
 
                // Filter out canceled and diverted flights
                if (!thisRow[49].Contains("1") && !thisRow[51].Contains("1"))
                {
                    // Add arrival delay from column 44 to list
                    arrivalDelays.Add(System.Convert.ToDouble(thisRow[44]));
                }
            }
        }
        // Convert list to numeric dense array and return it from reader
        return msnl.NumericDenseArrayFactory.CreateFromSystemArray<double>(arrivalDelays.ToArray());
    }
}

 

Note!
See the blog post titled “Cloud Numerics” Example: Using the IParallelReader Interface for more details on how to use the IParallelReader interface.

Step 3: Implement the Algorithm

After reading in the delays, we compute mean, we center the data to the mean (to make the worse-than-average delays positive and better-than-average ones negative), and then compute sample standard deviation. Then, to analyze the distribution of data, we count how many flights, on average, are more than k standard deviations away from the mean. Also, we keep track how many flights are above or below to mean, so as to detect any skew in data.

For example, if the data were normal distributed one would expect it to be symmetric around the mean, and obey the three-sigma rule —or, equivalently, about 16% of flights would be delayed by more than 1 standard deviation, 2.2 % by more than 2 standard deviations and 0.13% by more than 3.

To implement the algorithm, we add following code as the Main method to the application.

static void Main()
{
    // Initialize runtime
    Microsoft.Numerics.NumericsRuntime.Initialize();

    // Instantiate StringBuilder for writing output
    StringBuilder output = new StringBuilder();

    // Read flight info
    string containerAddress = @"https://cloudnumericslab.blob.core.windows.net/flightdata/";
    var flightInfoReader = new FlightInfoReader(containerAddress);
    var flightData = Loader.LoadData<double>(flightInfoReader);

    // Compute mean and standard deviation
    var nSamples = flightData.Shape[0];
    var mean = Descriptive.Mean(flightData);
    flightData = flightData - mean;
    var stDev = BasicMath.Sqrt(Descriptive.Mean(flightData * flightData) * ((double)nSamples / (double)(nSamples - 1)));

    output.AppendLine("Mean (minutes), " + mean);
    output.AppendLine("Standard deviation (minutes), " + stDev);

    // Compute how much of the data is below or above 0, 1,...,5 standard deviations

    long nStDev = 6;
    for (long k = 0; k < nStDev; k++)
    {
        double aboveKStDev = 100d * Descriptive.Mean((flightData > k * stDev).ConvertTo<double>());
        double belowKStDev = 100d * Descriptive.Mean((flightData < -k * stDev).ConvertTo<double>());
        output.AppendLine("Samples below and above k standard deviations (percent), " + k + ", " + belowKStDev + ", " + aboveKStDev);
    }

    // Write output to a blob
    WriteOutput(output.ToString());

    // Shut down runtime
    Microsoft.Numerics.NumericsRuntime.Shutdown();
}

Step 4: Write Results to a Blob

We add the WriteOutput method that writes results to binary large object (blob) storage as .csv-formatted text.  The WriteOutput method will create a container “flightdataresult” and a blob “flightdataresult.csv.” You can then view this blob by opening your file in Excel. For example: https://<myAccountName>.blob.core.windows.net/flightdataresult/flightdataresult.csv

--Where <myAccountName> is the name of your Windows Azure account.

Let’s fill in the last missing piece from the application, the WriteOutput method, with following code.

Note that you’ll have to change "myAccountKey" and "myAccountName" to reflect your own storage account key and name.

static void WriteOutput(string output)
{
    // Write to blob storage
    // Replace "myAccountKey" and "myAccountName" by your own storage account key and name
    string accountKey = "myAccountKey";
    string accountName = "myAccountName";
    // Result blob and container name
    string containerName = "flightdataresult";
    string blobName = "flightdataresult.csv";

    // Create result container and blob
    var storageAccountCredential = new StorageCredentialsAccountAndKey(accountName, accountKey);
    var storageAccount = new CloudStorageAccount(storageAccountCredential, true);
    var blobClient = storageAccount.CreateCloudBlobClient();
    var resultContainer = blobClient.GetContainerReference(containerName);
    resultContainer.CreateIfNotExist();
    var resultBlob = resultContainer.GetBlobReference(blobName);

    // Make result blob publicly readable,
    var resultPermissions = new BlobContainerPermissions();
    resultPermissions.PublicAccess = BlobContainerPublicAccessType.Blob;
    resultContainer.SetPermissions(resultPermissions);

    // Upload result to blob
    resultBlob.UploadText(output);
}

 

Step 5: Deploy the Application and Analyze Results

Now, you are ready to run the application. Set AppConfigure as the startup project, build the application, and use the Cloud Numerics Deployment Utility to deploy and run the application.

Let’s take a look at the results. We can immediately see they’re not normal-distributed at all. First, there’s skew —about 70% of the flight delays are better than average of 5 minute. Second, the number of delays tails off much more gradually than a normal distribution would as one moves away from the mean towards longer delays. A step of one standard deviation (about 35 minutes) roughly halves the number of delays, as we can see in the sequence 8.5 %,  4.0 %, 2.1%, 1.1 %, 0.6 %. These findings suggests that the tail could be modeled by an exponential distribution.

flightanalysisresult

This result is both good news and bad news for you as a passenger. There is a good 70% chance you’ll arrive no more than five minutes late. However, the exponential nature of the tail means —based on conditional probability— that if you have already had to wait for 35 minutes there’s about a 50-50 chance you will have to wait for another 35 minutes.