Francesco Cogno's blog

Where data hit programming languages. Hard. In the shin.

Azure Data Lake BI sample using Open Data Lombardia

Azure Data Lake storage and Analytics compose the brand new Big Data solution provided by Microsoft. These services are fully managed (PaaS) services built on top of Hadoop, Yarn and many other Apache technologies. This diagram shows the high-level architecture:

https://acom.azurecomcdn.net/80C57D/cdn/cvt-82db65c8a8fbe1144a2cc781c7854fd6b693ac4ed7f359892ce36e4a73c75335/images/page/solutions/data-lake/data-lake-diagram.png.

Unique in this diagram is the U-SQL language. U-SQL is a Microsoft new language. It allows you to interact with semi-structured and structured data with a SQL-like syntax. As with every yarn client the U-SQL tasks can be parallelized easily, achieving unparalleled query speeds. Don’t be fooled by SQL part of U-SQL though: underneath everything is C#. This means that you can – if necessary – use the expressiveness of C# to handle your data. This also means your U-SQL scripts will be compiled to native code before execution. Last but not least you can use Visual Studio for the authoring, execution and debugging phase.

Getting started with Azure Data Lake store

Azure data lake store, once created, is a file system accessible as URL and the new URI scheme adl:

The Azure portal shows the file system like this:

You can create folders and upload files directly from the portal. Notice that we don’t have to specify awkward Hadoop parameters. Everything is handled behind the scenes. PaaS magic.

Open Data Lombardia

In order to show you some U-SQL we need a dataset. I choose to use the Open Data Regione Lombardia datasets accessible here: https://www.dati.lombardia.it/. The site gives access to very interesting datasets, including sensor data. I’ve downloaded some weather data (stored in a file called sensori_arpa_2015.csv, ~250 MB) and related dimension dataset (called Stazioni_Meteorologiche.csv ~90KB) for this demo. The data is provided under the Italian Open Data License v2.0 available here: http://www.dati.gov.it/iodl/2.0/.

To upload the files to Azure Data lake you can use the portal. You can, among other methods, use PowerShell too:

AzureRmDataLakeStoreItem -Account <account> -Path <local_path> -Destination <remote_path>

One cool feature available in the portal is the file preview. Azure tries to interpret your file and, if successful, will show you the tabular data. This is handy because we get to see the columns in our datasets and we can inspect the data type.

As you can see, our file comes with the first row as header. While explanatory it does pose a problem while parsing the actual data. At the moment of writing this post there is no way of programmatically skip the header. The feature will come in the future (see https://feedback.azure.com/forums/327234-data-lake/suggestions/11500386-skip-header-in-extractor). For the time being we just re-upload the files without the header (in a following blog post I’ll show you how to handle headers in Spark).

First job

Azure Data lake jobs can be authored in the Azure portal directly or using the free Visual Studio extension. I suggest you the latter as it comes with a local simulator (the free Visual Studio extension works on the Visual Studio Community edition, which is also free). Either way we want to perform the simplest possible task: read the data and copy it as it is in another file.

This can be accomplished with this very simple U-SQL script:

Here worth noting are:

  1. The Csv and Tsv SerDe. In U-SQL terminology we call them Extractors and Outputters. So we extract from extractors and output to outputters.
  2. We can specify the data type directly in the extractor. This is great: we don’t have to parse the type ourselves as long as its string representation is parsable in the specified type. In case of problems we can instruct Azure Data Lake to either throw an error – which we can debug directly in VS – or to ignore the offending row. Sometimes, especially when munching trillions of rows for statistical analysis, handling a mis-formatted row is simply not worth the effort.

We can submit the job and it will give us this trivial execution plan (click on the image to see a larger version):

This is just an image but if you submit it with Visual Studio you will see the data flowing real-time. Notice that the first operator has read 233 MB (that’s unsurprising since the file size if 233 MB) but written 137 MB. This means the data is serialized in a different, more efficient, fashion. We copied 7,2 million rows in our destination file that will be present in Data Lake under the output subfolder (U-SQL jobs create the folders for you as long as you have the rights to do so). 

Aggregate, enter BI

One typical task is to count the occurrences of a specific entry (being dollars, tweets, or provolone slices). The SQL syntax makes this super-easy:

Please note we are ordering the resultset in the OUTPUT task and not in the SELECT. Attempting to do so would cause this error:

An ORDER BY clause used outside of an OUTPUT statement must have a FETCH subclause..

Basically if we sort an intermediate resultset we must limit its cardinality. This makes sense as the sort is a very expensive operation. Of course we can sort the final resultset as done above.

Notice we have read 233 MB again but this time we have serialized only 28 MB. This is good (great actually) because it means Azure Data lake is pushing the cardinality-minimizing operators down the execution plan (well, up in this case as the execution plan is from top to bottom :)).

Join to denormalize

As stated previously there are two datasets. One for the dimensions and the other for the measures. This is common in DWH scenarios but for querying purposes sometimes we have denormalize these two datasets. Fortunately a simple join handles the issue:

Note one little thing here. Since the expressions are C# evaluated we must use the double equal sign to check for equality, like this:

e.SensorId == s.SensorId

 If we were to specify a single equal sign, as we do in T-SQL, the parser would interpret it as an assignment and the join would error out.

The execution plan is more interesting. Note that our denormalized data is almost 1 GB. Also note that we are storing the same number of rows read from the event dataset. This is expected since we used a LEFT OUTER JOIN.

Insight!

Our dataset is ready to be analysed. We have sensor data and now we know what sensors mean. All we have to do is project this dataset in a meaningful way. Let’s answer this business requirement:

show me the temperature evolution of one specific sensor: the one in Lambrate, Milan.

Here I have read the starting datasets again but we could have used the pre-calculated denormalized one instead.

Once the job completes we can render it using the PowerBI connector in Excel:

Not surprisingly, the temperature drops from october to dicember (our whole time span). Also it’s interesting to note that in the 2015 the temperature has gone below 0°C only a few days (global warming and such).

The execution plan is this one:

Going on

Two other info before closing this long post.
1. The publish to PowerBI feature is coming. With that feature we will be able to generate graphs like the one above directly in the web. Also, the graphs will update automatically.
2. Did I mention you can create tables in Azure Data Lake? It’s very easy and useful if you plan to reuse the data in later jobs. Here is how to store the denormalized data in a table:

 

Rember, the types are C# here, not ANSI-SQL. So in order to specify a nullable column ve must use the question mark C# syntax (see here https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx for more details). In short, whenever we see <type>? we should interpret as <type> NULL. Also remember: unless explicitly NULLable, all fields are NOT NULLable!

Happy coding,
Francesco Cogno