You sit at the airport only to witness your departure time get delayed. You wait. Your flight gets delayed again, and you wonder “what’s happening?” Can you predict how long it will take to arrive at your destination? Are there many short delays in front of you or just a few long delays? This example demonstrates how you can use “Cloud Numerics” to sift though and calculate a big enough cross section of air traffic data needed to answer these questions. We will use on-time performance data from the U.S. Department of Transportation to analyze the distribution of delays. The data is available at:
This data set holds data for every scheduled flight in the U.S. from 1987 to 2011 and is —as one would expect— huge! For your convenience, we have uploaded a sample of 32 months —one file per month with about 500,000 flights in each— to Windows Azure Blob Storage at container URI: http://cloudnumericslab.blob.core.windows.net/flightdata (note that you cannot access this URI directly in a browser, you must use a Storage Client C# APIs as in Step 2, or a REST API query).
Before You Begin
- You will need to have Microsoft Codename “Cloud Numerics” installed.
- Create a Windows Azure subscription for deploying and running the application (if do not have one already). You also need a Windows Azure storage account set up for saving results.
- Install “Cloud Numerics” on your local computer where you develop your C# applications.
- Run the example application as detailed in the “Cloud Numerics” Lab Getting Started wiki page to validate your installation.
You should use two-to-four compute nodes when deploying the application to Windows Azure. One node might not have enough memory, and for larger-sized deployments there are not enough files in the sample data set to assign to all distributed I/O processes. You should not attempt to run the application on a local system because of data transfer and memory requirements.
|You specify how many compute nodes are allocated when you use the Cloud Numerics Deployment Utility to configure your Windows Azure cluster. For details, see this section in the Getting Started guide.|
Step 1: Create Cloud Numerics project
First, let’s create a Cloud Numerics application using the Cloud Numerics Visual Studio project template.
1. Start Microsoft Visual Studio and create a solution from the Cloud Numerics project template. Let’s call the solution “OnTimeStats.”
2. Create a new application, and delete the sample code from within the MSCloudNumericsApp subproject’s Program.cs file. Replace it with the following skeleton of an application.
3. Add a .NET reference to Microsoft.WindowsAzure.StorageClient. This assembly is used for reading in data from blobs, and writing results.
Step 2: Add Methods for Reading Data
We’ll use the IParallelReader interface to read in the data from blob storage in the following manner:
- In the ComputeAssignment method, we must first list the blobs in the container that holds the flight data (http://cloudnumericslab.blob.core.windows.net/flightdata). Then, we assign blobs to each worker node, in round-robin fashion.
- In the ReadWorker method, we instantiate a list “arrivalDelays”. Each reader will then open the first blob on their list, download the text, break it to lines, break the line to columns, and select column 44 that holds the arrival delays in minutes. Note that if a flight was canceled or diverted, this value is blank and we must skip the row. We convert the result to a double precision type and append it to the arrivalDelays list. Then, we either move to the next blob or –if done– convert the list into NumericDenseArray and return it.
For Step 2, add the following code to the FlightInforReader class in your skeleton application.
|See the blog post titled “Cloud Numerics” Example: Using the IParallelReader Interface for more details on how to use the IParallelReader interface.|
Step 3: Implement the Algorithm
After reading in the delays, we compute mean, we center the data to the mean (to make the worse-than-average delays positive and better-than-average ones negative), and then compute sample standard deviation. Then, to analyze the distribution of data, we count how many flights, on average, are more than k standard deviations away from the mean. Also, we keep track how many flights are above or below to mean, so as to detect any skew in data.
For example, if the data were normal distributed one would expect it to be symmetric around the mean, and obey the three-sigma rule —or, equivalently, about 16% of flights would be delayed by more than 1 standard deviation, 2.2 % by more than 2 standard deviations and 0.13% by more than 3.
To implement the algorithm, we add following code as the Main method to the application.
Step 4: Write Results to a Blob
We add the WriteOutput method that writes results to binary large object (blob) storage as .csv-formatted text. The WriteOutput method will create a container “flightdataresult” and a blob “flightdataresult.csv.” You can then view this blob by opening your file in Excel. For example: http://<myAccountName>.blob.core.windows.net/flightdataresult/flightdataresult.csv
--Where <myAccountName> is the name of your Windows Azure account.
Let’s fill in the last missing piece from the application, the WriteOutput method, with following code.
Note that you’ll have to change "myAccountKey" and "myAccountName" to reflect your own storage account key and name.
Step 5: Deploy the Application and Analyze Results
Now, you are ready to run the application. Set AppConfigure as the startup project, build the application, and use the Cloud Numerics Deployment Utility to deploy and run the application.
Let’s take a look at the results. We can immediately see they’re not normal-distributed at all. First, there’s skew —about 70% of the flight delays are better than average of 5 minute. Second, the number of delays tails off much more gradually than a normal distribution would as one moves away from the mean towards longer delays. A step of one standard deviation (about 35 minutes) roughly halves the number of delays, as we can see in the sequence 8.5 %, 4.0 %, 2.1%, 1.1 %, 0.6 %. These findings suggests that the tail could be modeled by an exponential distribution.
This result is both good news and bad news for you as a passenger. There is a good 70% chance you’ll arrive no more than five minutes late. However, the exponential nature of the tail means —based on conditional probability— that if you have already had to wait for 35 minutes there’s about a 50-50 chance you will have to wait for another 35 minutes.