Commuting Stats with Azure Functions and Azure Data Lake


Like most people I have to get up in the morning, and get to work in some way. This usually means hopping on the metro, possibly followed by a bus afterwards, or a short stroll. (Depends on my end destination, which can vary.) The first part of the journey with the metro is fairly constant though.

And also like most people I have experienced annoyance when I get to the platform, and the signs say "there´s a delay" or something similar. While being late for the first cup of coffee in the morning is something one could live with, it is more worrying if this means I´ll be late for a meeting because I haven´t allocated a buffer. (I try not to have meetings too early, but sometimes it can´t be avoided. And I hate being late for meetings.)

What I have done to try to make me less surprised is to send myself a Slack notification, (TimerTriggered Azure Function), roughly a half hour before I usually leave telling me if there are any delays to be aware of. If the notification says one train is delayed I usually don´t worry. If it says five trains are delayed I look into it further to decide if I need to rush out the door or make alternative transportation plans.

My metro´s punctuality is not exactly in the Shinkansen range. (https://en.wikipedia.org/wiki/Shinkansen#Punctuality) To be fair this pretty much applies for all non-Shinkansen trains. But how bad is it really? While I can remember if there were a lot of issues one particular week, (at least for the following two weeks), I don´t know how it averages out over time. And I don´t care about the average for the entire city - I care about my stop specifically. As with most of my blog posts this leads to an opportunity to play around with Azure technologies to try and get closer to an answer 🙂

The public transportation provider I use has an open API which means two things:
- I don´t have to do a lot of clever things to acquire the data. The API tells me which metros are coming at my stop, with both the arrival/departure time listed on its schedule as well as the expected time (as in delays).
- You can follow along with the samples if you like even if you don´t live in my area 🙂

True, it´s probably more fun if you are able to locate a similar API for the place you live, but this is more about building a working use case of Azure tech than the low level details of implementation.

I know that the data I get from the API is in JSON, and I know that it´s fairly easy to just dump these responses as files into storage. It will however create a fairly large amount of files, and parsing through them individually afterwards might be cumbersome.

Putting on the "architect hat" this seems to be what is referred to as a datadriven solution. While not diving in to all the intricacies of different patterns, one often refers to there being four phases/components of a datadriven solution:
- Ingest: acquire the data points, and push them somewhere.
- Store: store the data points (could be as files, could be as database records).
- Process: analyze and process the data.
- Present: output a distilled presentation of the insight gained from the collection of data points.

Azure offers a range of services for all of these needs, and it would be a fairly verbose blog post if I went into all the details of when to choose which, so I´ll spare you the considerations taken for now.

I went with the following Microsoft services to try and solve this:
- Azure Functions: calling API on a schedule, and storing the responses as JSON files. (Ingest.)
- Azure Data Lake Store: storing the JSON files. (Store.)
- Azure Data Lake Analytics: extract data from the JSON files. (Process.)
- Power BI and Excel: for "gaining insight". (Present.)

Sure, I could use Blob Storage for the files, or I could have uploaded the JSON to DocumentDB. There are certainly other options, but this is the path I have chosen for this setup.

Let´s take a closer look at these phases.

Azure Functions - Ingest
The API is RESTful, and I want to call it on a regular basis, so Azure Functions with a TimerTrigger seems like a good choice for me. To be able to serialize/deserialize the response I needed to generate some classes - which is dead simple if you use something like http://json2csharp.com to help you out 🙂

While the JSON is perfectly understandable when reading it, I quickly saw that there were some things of little value to me, and I didn´t need the same hierachy either, so I created a custom class for the properties I care about, and that´s the format of the JSON I output.

As stated above I went with uploading the files to Azure Data Lake Store, instead of uploading it to Blob Storage (which perhaps would be the default for many of us). The code to do this isn´t more complicated than writing to blob, and it´s a fairly simple process to dump things there. (Remember to verify that the app credentials you use are assigned access on the folder it is trying to upload to.)

Azure Data Lake Store (ADLS) - Store
However, I think we might need to pause here for a moment. Well, if you happen to know all the ins and outs of big data already maybe you don´t need it, but running with the assumption that not all devs are there I´ll elaborate. (I am definitely not a big data expert.)

The "data lake" in Azure is a place where you can put all of your files whether they are in a binary or a text-based format. There´s no limit to the amount of files, or size restrictions. Basically one big happy hard drive. It also works similar to a hard drive in the sense that you create folders, ACL the folders, etc. (Blob Storage lets you "fake" folders, but work slightly different underneath.)

The access control is based on AAD, which is different from the shared access keys Blob Storage uses.

The real kicker though is that it is integrated with analytics tools letting you work directly on the files as opposed to when pushing and popping files between blob and in-memory processes. Which means that if you have a bunch of JSON files you can execute SQL over them without any extra hassle.

So, getting back to actual dev work...

You should go through the setup procedure for authentication first:
Service-to-service auth with Data Lake Store using AAD:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-authenticate-using-active-directory

I also based myself on the instructions here for interacting with ADLS (this is just for reference, you don´t need to follow any tasks in that documentation piece):
ADLS - Using .NET SDK
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-net-sdk

The actual Azure Function code looks like this:

using System;
using System.IO;
using System.Threading;
using System.Net.Http.Headers;
using Newtonsoft.Json;

using Microsoft.Azure.Management.DataLake.Store;
using Microsoft.Azure.Management.DataLake.Store.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using Microsoft.Rest.Azure.Authentication;

public static void Run(TimerInfo input, TraceWriter log)
{
    log.Info($"C# timer triggered function called.");

    DataLakeStoreAccountManagementClient _adlsClient;
    DataLakeStoreFileSystemManagementClient _adlsFileSystemClient;

    string _adlsAccountName;
    string _subId;

    _adlsAccountName = "contoso_adls"; 
    _subId = "guid";
    
    //auth stuff
    var domain = "contoso.onmicrosoft.com";
    var webApp_clientId = "adls_app_id";
    var clientSecret = "adls_app_secret";
    var clientCredential = new ClientCredential(webApp_clientId, clientSecret);
    
    var creds = ApplicationTokenProvider.LoginSilentAsync(domain, clientCredential).Result;
    
    // Create client objects and set the subscription ID
    _adlsClient = new DataLakeStoreAccountManagementClient(creds) { SubscriptionId = _subId };
    _adlsFileSystemClient = new DataLakeStoreFileSystemManagementClient(creds);
    
    //Look up Ruter API
    using (var client = new HttpClient())
    {
        client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
        client.DefaultRequestHeaders.Add("User-Agent", "AzureFunctions");
        
        //ID for "Jernbanetorget" Stop
        var uri = "http://reisapi.ruter.no/StopVisit/GetDepartures/3010011";
        HttpResponseMessage response = client.GetAsync(uri).Result;
        var responseString = response.Content.ReadAsStringAsync().Result;

        var timeStamp = System.DateTime.Now.ToString("yyyy-MM-dd-HH-mm-ss");
        log.Info($"Time Stamp: {timeStamp}");
        var fileName = "JERN_" + timeStamp + ".json";
        string srcFolderPath = @"D:\home\site\"; 
        string srcFilePath = Path.Combine(srcFolderPath, fileName); 
        string destFolderPath = "/Ruter/";
        string destFilePath = Path.Combine(destFolderPath, fileName);
        
        //Save to the file system of app service temporarily
        System.IO.File.WriteAllText(@"D:\home\site\" + fileName, responseString);
        //Delay to allow file to be written to disk before uploading
        System.Threading.Thread.Sleep(2000);

        log.Info($"src: {srcFilePath}, dest: {destFilePath}");

        try
        {
            _adlsFileSystemClient.FileSystem.UploadFile(_adlsAccountName, srcFilePath, destFilePath, overwrite:true);
            log.Info($"Uploaded file: {fileName}");
        }
        catch(Exception ex)
        {
            log.Info($"Upload went foo: {ex.Message}");
        }      

        var departures = JsonConvert.DeserializeObject<List<RootObject>>(responseString);       
        
        foreach (var departure in departures)
        {            
            var depJson = new Departure{DestinationName= departure.MonitoredVehicleJourney.DestinationName,
                AimedArrivalTime = departure.MonitoredVehicleJourney.MonitoredCall.AimedArrivalTime,
                ExpectedArrivalTime = departure.MonitoredVehicleJourney.MonitoredCall.ExpectedArrivalTime,
                Delay = departure.MonitoredVehicleJourney.Delay,
                VehicleJourneyName = departure.MonitoredVehicleJourney.VehicleJourneyName,
                InCongestion = departure.MonitoredVehicleJourney.InCongestion,
                DataFrameRef = departure.MonitoredVehicleJourney.FramedVehicleJourneyRef.DataFrameRef,
                DatedVehicleJourneyRef = departure.MonitoredVehicleJourney.FramedVehicleJourneyRef.DatedVehicleJourneyRef};

            DateTime aimed = Convert.ToDateTime(departure.MonitoredVehicleJourney.MonitoredCall.AimedArrivalTime);
            DateTime expected = Convert.ToDateTime(departure.MonitoredVehicleJourney.MonitoredCall.ExpectedArrivalTime);
            TimeSpan delay = expected - aimed;            
            depJson.Delay = delay.ToString();
            log.Info($"Delay (Ruter API): {departure.MonitoredVehicleJourney.Delay}, Delay (Calculated): {depJson.Delay}");
                
            var dep = JsonConvert.SerializeObject(depJson);                        
            fileName = $"{depJson.DatedVehicleJourneyRef}_{depJson.DataFrameRef}.json";
            
            //Save to the file system of app service temporarily
            System.IO.File.WriteAllText(@"D:\home\site\" + fileName, dep);
            //Delay to allow file to be written to disk before uploading
            System.Threading.Thread.Sleep(2000);

            destFolderPath = "/Ruter/departures/";
            srcFilePath = Path.Combine(srcFolderPath, fileName);         
            destFilePath = Path.Combine(destFolderPath, fileName);

            try
            {
                _adlsFileSystemClient.FileSystem.UploadFile(_adlsAccountName, srcFilePath, destFilePath, overwrite:true);
                log.Info($"Uploaded file: {fileName}");
            }
            catch(Exception ex)
            {
                log.Info($"Upload went foo: {ex.Message}");
            }
        }        
    }      
}

public class Departure
{
    public string DataFrameRef { get; set; }
    public string DatedVehicleJourneyRef { get; set; }
    public string DestinationName { get; set; }
    public string AimedArrivalTime { get; set; }
    public string ExpectedArrivalTime { get; set; }
    public bool InCongestion { get; set; }
    public string Delay { get; set; }
    public string VehicleJourneyName { get; set; }
}

public class FramedVehicleJourneyRef
{
    public string DataFrameRef { get; set; }
    public string DatedVehicleJourneyRef { get; set; }
}

public class TrainBlockPart
{
    public int NumberOfBlockParts { get; set; }
}

public class MonitoredCall
{
    public int VisitNumber { get; set; }
    public bool VehicleAtStop { get; set; }
    public string DestinationDisplay { get; set; }
    public string AimedArrivalTime { get; set; }
    public string ExpectedArrivalTime { get; set; }
    public string AimedDepartureTime { get; set; }
    public string ExpectedDepartureTime { get; set; }
    public string DeparturePlatformName { get; set; }
}

public class MonitoredVehicleJourney
{
    public string LineRef { get; set; }
    public string DirectionRef { get; set; }
    public FramedVehicleJourneyRef FramedVehicleJourneyRef { get; set; }
    public string PublishedLineName { get; set; }
    public string DirectionName { get; set; }
    public string OperatorRef { get; set; }
    public string OriginName { get; set; }
    public string OriginRef { get; set; }
    public string DestinationRef { get; set; }
    public string DestinationName { get; set; }
    public string OriginAimedDepartureTime { get; set; }
    public string DestinationAimedArrivalTime { get; set; }
    public bool Monitored { get; set; }
    public bool InCongestion { get; set; }
    public string Delay { get; set; }
    public TrainBlockPart TrainBlockPart { get; set; }
    public string BlockRef { get; set; }
    public string VehicleRef { get; set; }
    public int VehicleMode { get; set; }
    public string VehicleJourneyName { get; set; }
    public MonitoredCall MonitoredCall { get; set; }
    public string VehicleFeatureRef { get; set; }
}

public class OccupancyData
{
    public bool OccupancyAvailable { get; set; }
    public int OccupancyPercentage { get; set; }
}

public class Deviation
{
    public int ID { get; set; }
    public string Header { get; set; }
}

public class Extensions
{
    public bool IsHub { get; set; }
    public OccupancyData OccupancyData { get; set; }
    public List<Deviation> Deviations { get; set; }
    public string LineColour { get; set; }
}

public class RootObject
{
    public string RecordedAtTime { get; set; }
    public string MonitoringRef { get; set; }
    public MonitoredVehicleJourney MonitoredVehicleJourney { get; set; }
    public Extensions Extensions { get; set; }
}

And to not get any error on missing references you will want this in your project.json:

{
  "frameworks": {
    "net46":{
      "dependencies": {
        "Microsoft.Azure.Management.DataLake.Store": "2.2.0",
        "Microsoft.Rest.ClientRuntime.Azure.Authentication": "2.3.1"
      }
    }
   }
}

Azure Data Lake Analytics (ADLA) - Process
This leads us to the analytical part of the data lake. ADLA lets you crunch through the contents of ADLS using a language called U-SQL which is a mix between C# and SQL. You create jobs, allocate how much resources Azure should use to run the job, and wait for it to finish. The output of the job is in turn also placed in a folder in ADLS. (So, you should maintain a reasonable structure inside your ADLS account.)

I´m no SQL guru or dba, so for my analysis I went with a fairly simple approach. Loop through all departures, extract the ones that are not on time, and write the results to a csv file. (Since this file is not used in further programmatic tasks I don´t need the overhead of JSON.) I´m also being unrealistically strict about delays - even if it is just one second off it still counts as a delay 🙂 It is kind of silly anyways since it´s sort of hard to work with that precision level on such an object - it´s not like you will notice the number of seconds it uses when coming to a stop. There is also the possibility of a negative delay - in other words, it arrives ahead of time. Which can in turn also be a bad thing if it leaves the platform two minutes before I plan to be there.

You can write and execute the query directly in the Azure Portal, or you can run it through Visual Studio. A minor snag that I ran into when using the portal was the need for registering the assemblies I use, so you might want to check out these instructions first:
How to register U-SQL Assemblies in your U-SQL Catalog
https://blogs.msdn.com/azuredatalake/2016/08/26/how-to-register-u-sql-assemblies-in-your-u-sql-catalog

Anyways - the U-SQL works out like this:

REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];

DECLARE @out string="adl://contoso.azuredatalakestore.net/foo/delayedDepartures.csv";

@departure =
  EXTRACT DataFrameRef string,
  DatedVehicleJourneyRef string,
  DestinationName string,
  AimedArrivalTime string,
  ExpectedArrivalTime string,
  InCongestion string,
  Delay string,
  VehicleJourneyName string,
  filename string
  FROM "adl://contoso.azuredatalakestore.net/Ruter/departures/{filename:*}.json"
  USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();

@testthis = SELECT * 
  FROM @departure
  WHERE Delay != "00:00:00";

OUTPUT @testthis TO @out USING Outputters.Csv();

You will notice there are four parts to it. Declaring where the output goes, defining the format of the data (schema on read) as well as locating the source files, the actual query to run, and concluding by outputting a csv file.

Now, it is of course easy to say for our little use case that there´s not a big gain by using ADLA. Clearly I could have done things programmatically in my Function to handle things like this. Or I could have written a different app that parsed thing for me without involving anything SQL-like. That is true - for what I do here it is more of a Proof-of-Concept that you can use ADLA. But for more complex queries you don´t want to mess around with code dependencies, and have the "data people" talk to developers each time they want to get some work done. My use of Azure Functions is as a neutral collector of data. The code doesn´t have an opinion on the data; it simply makes sure the data gets to a point where they can be turned into something more useful than isolated data points. (Yes, I slightly tweak the data that I store, but I don´t analyze them in the Function.)

Do be aware though that as the number of files grow it might take some time to actually complete the job. Here´s a test run with roughly 1500 files and one work unit assigned.

I have no idea why it takes this long, but I´m guessing there are things I could optimize. Different queries, multiple queries and/or jobs. If I was using this for a "production use case" I would probably run the job during the night anyways, and have fresh analytics served in the morning on how the previous day went by so I might not be bothered by it taking an extra hour. Just don´t expect real time results this way 🙂

You can of course throw more power at it by allocating more work units. It comes at a higher cost pr minute, but if the query scales perfectly it will balance itself out by taking less time to complete. (I´m guessing not all jobs actually are that perfect.)

Excel & Power BI - Present
Ok, so now we have a nice little csv file with the numbers for a selected number of departures. "Let me see - the 11:14 train arrived at 11:14:30, and the 12:13 in the other direction arrived 10 seconds early..." What you´re probably interested in knowing are the highlights - what is the percentage of delayed versus not-delayed? What is the average delay? In which direction are the trains most likely to be delayed?

The query I ran above does not count the number of departures, so the percentage can´t easily be extracted. I ran a separate query to get all departures, and manually verified by counting the number of files to 315. For some reason the API doubles up the number of departures as it works with both the theoretical (scheduled), and the actual departure. So, you can have one entry having an AimedArrivalTime of 10:00:00, and another one saying 10:00:05. To complicate matters further - if the AimedArrivalTime is 10:00:05, and the ExpectedArrivalTime is 10:00:05 it doesn´t count as delayed. This means that technically you need to match up the theoretical, and the actual departure, and calculate delays accordingly. This would require me to do more "crunching" in the Function to increase the accuracy level. For the purpose of this exercise I will skip this part, and allow for some rougher calculations. The raw numbers are there, so they can be revisited if I want to.

Accepting the numbers aren´t full precision I´m seeing 89 delays out of 157 departures; meaning roughly 57% of them are delayed in some way.

The average delay is easy enough to do in Excel:

Note that I´m averaging the delayed departures, I´m not taking into account the undelayed departures when doing the average.

Excel is powerful, but if you´re into playing with the data in a more fancy manner you could look into Power BI. It enables you to generate reports, and graphs for dashboard type views in an easy manner. If I want to know whether there are specific departures during a day which are consistently delayed I could just ask and get a nice graph:

(The "DatedVehicleJourneyRef" id attribute on an individual departure seems to be consistent regardless of which day it is. There are not enough data points to actually see a pattern here.)

Taking steps further one could hook into APIs for weather reports, and match up this with the departures. If I wake up in the winter, and see that there´s suddenly a waist-high amount of snow outside I figure there might be the odd delay without assistance from a computer to tell me so. But what about things like rain? The trains are of course waterproof, but is there a correlation between water dripping from the skies and inability to arrive on time? I don´t know, but combining the data sources might indicate a pattern.

Or if you´re a better data wrangler than me you can hook these two streams, (and possibly other sources as well), into Machine Learning. Before bedtime you can have a notificiation telling you there´s a 90% chance your train will be 10 minutes off the next morning 🙂 Get up earlier, or auto-send a mail the next morning that you will be late are of course both valid options.

Wrapping up
Did I get any closer to figuring things out? Do I have a better idea of the reliability of the metro? Well, hard to say. I don´t really have enough historical data at the moment to pass judgment. There are many delays, but most of them are small so it´s not a big issue.

I am able to see that there is a skew with regards to which direction has the most delays. My stop is only a few stops from the end/start of the line, and unsurprisingly it´s the train nearing the end of the line that has the most delays. As long as the delay is short it seems it only leads to less break time for the driver when he arrives at the end destination. He can then "turn the train around" and reset things accordingly.

Regardless of that it was a fun tech demo to get more familiar with Azure Data Lake, and with some imagination you can probably turn it into something more useful for your needs.

Comments (1)

  1. Sjoukje Zaal says:

    Nice one, Andreas! Great combination of different technologies.

Skip to main content