An Introduction to U-SQL in Azure Data Lake

Azure Data Lake Analytics is a Big Data analytics service. It differs from HDInsight principally in that you do not need to spin up a cluster to start submitting jobs. Essentially it runs as a service that is ready and waiting to process jobs.

While in HDInsight you might use Pig or Hive to query or transform data, in Azure Data Lake Analytics jobs are submitted using U-SQL

U-SQL is a powerful Big Data language that combines the use of SQL and C# to perform operations in a scalable manor.

This post show some of the capabilities of U-SQL to give a basic understanding of functionality and syntax.

Setup

An Azure subscription will be needed as well as Azure Data Lake Storage and Azure Data Lake Analytics (both currently in public preview)

All the examples use data that is freely available from https://data.police.uk/

For the examples two folders have been created /Demofolder and /DemoFolderResult. Demofolder holds the crime data from the police website and DemoFolderResult is the destination folder for transformations.

The tooling for Visual Studio may be found here

These examples are reading from the default root which is Azure Data Lake Storage (an HDFS instance running in Azure). U-SQL may also read from Blob storage with the use of the wasb:// syntax

Example 1 - Simple Copy - CSV to CSV

A very simple example to start. Load data into a variable and copy out to the result location a CSV

 @policefiles =
    EXTRACT CrimeID String,
Month String,
Reportedby String,
Fallswithin String,
Longitude String,
Latitude String,
Location String,
LSOAcode String,
LSOAname String,
Crimetype String,
Lastoutcomecategory String,
Context String
    FROM "/DemoFolder/2014-05-cumbria-street.csv"
    USING Extractors.Csv();

OUTPUT @policefiles 
    TO "/DemoFolderResult/output.csv"
USING Outputters.Csv();

Example 2 - Use of SQL and C# Functions

The following example demonstrates some standard functions. We are using aggregates with a GROUP BY clause and ROW_NUMBER() to give us an arbitrary ID column. We are also mixing SQL with C# by using ToUpper() and ToLower() and taking advantage of C#s ability to use regular expressions to search for the keyword 'Theft'.

This ability for U-SQL to mix and match SQL and C# makes it an incredibly powerful language for investigatory or transformation scenarios.

 DECLARE @From string = "/DemoFolder/2014-05-cumbria-street.csv";
DECLARE @To string = "/DemoFolderResult/outputtrans.csv";

@policefiles =
EXTRACT CrimeID String,
Month String,
Reportedby String,
Fallswithin String,
Longitude String,
Latitude String,
Location String,
LSOAcode String,
LSOAname String,
Crimetype String,
Lastoutcomecategory String,
Context String
    FROM @From
    USING Extractors.Csv();

@count =
    SELECT ROW_NUMBER() OVER () AS id,
           Crimetype.ToUpper() AS Crimetype,
           Lastoutcomecategory.ToLower() AS Lastoutcomecategory,
           System.Text.RegularExpressions.Regex.IsMatch(Crimetype, "Theft", System.Text.RegularExpressions.RegexOptions.IgnoreCase) AS RegExResult,
           countresult
    FROM
    (
        SELECT Crimetype,
               Lastoutcomecategory,
               COUNT(Crimetype) AS countresult
        FROM @policefiles AS A
        GROUP BY Crimetype,
                 Lastoutcomecategory
        ORDER BY Crimetype,
                 Lastoutcomecategory
        FETCH FIRST 10000 ROWS
    ) AS A;

OUTPUT @count
    TO @To
USING Outputters.Csv();

Example 3 - Scaling out to multiple files

The first two examples have only been looking at performing operations on one file. Azure Data Lake Analytics is scalable as you would expect any Big Data  solution to be. Operations are split up into tasks that can be performed across multiple containers. The number of containers to use in parallel may be chosen on job submission to scale out the computation.

This example demonstrates selecting multiple CSV files in a folder and counting the number of lines in each file.

Please note if you are running these examples in Visual Studio local mode this example currently will not yet run. It must be run against Azure Data Lake Analytics to succeed

 @policefiles =
    EXTRACT CrimeID String,
    Month String,
    Reportedby String,
    Fallswithin String,
    Longitude String,
    Latitude String,
    Location String,
    LSOAcode String,
    LSOAname String,
    Crimetype String,
    Lastoutcomecategory String,
    Context String,
    filename string     
    FROM "/DemoFolder/{filename:*}.csv"
    USING Extractors.Csv();

@count =
    SELECT filename, COUNT(A.CrimeID) AS countresult
    FROM @policefiles AS A
    GROUP BY filename;

OUTPUT @count
    TO "/DemoFolderResult/outputcount5.csv"
USING Outputters.Csv();

Example 4 - Custom C# Functions

This example demonstrates the use of custom C# functions in code. It uses C# to get the last word in the string and returns the result.

U-SQL
 @policefiles =
    EXTRACT CrimeID String,
    Month String,
    Reportedby String,
    Fallswithin String,
    Longitude String,
    Latitude String,
    Location String,
    LSOAcode String,
    LSOAname String,
    Crimetype String,
    Lastoutcomecategory String,
    Context String,
    filename string     
    FROM "/DemoFolder/{filename:*}.csv"
    USING Extractors.Csv();

@result =
    SELECT filename, CrimeID, Location, USQLApplication2.Udfs.splitLocations(Location) AS lsplit
    FROM @policefiles;

OUTPUT @result
    TO "/DemoFolderResult/outputresultCSharp.csv"
USING Outputters.Csv();
C#
 using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace USQLApplication2
{
    public class Udfs
    {
        public static string splitLocations(string location)
        {
            string res;
            if (location.Trim().LastIndexOf(" ") > 0)
            {
                res = location.Substring(location.Trim().LastIndexOf(" "));
            }
            else
            {
                res = "no last word";
            }
            return res;
        }
    }
}

Example 5 - Custom Extractor

This example demonstrates using a custom extractor to show the extensibility of U-SQL and its ability to deal with virtually any style of data.
The extractor is simply performing a split on each row using a comma and filtering out the data. In a real world scenario a custom extractor may use a regular expression to perform more complicated extraction of key data from unstructured sources.

U-SQL
 DECLARE @FileLocation = "/DemoFolder/2014-05-cumbria-street.csv";

@policefiles =
    EXTRACT CrimeID String,
    Crimetype String  
    FROM @FileLocation
    USING new USQLApplication2.MyExtractor();

OUTPUT @policefiles
    TO "/DemoFolderResult/outputresultCSharp11.csv"
USING Outputters.Csv();
C#
 using Microsoft.Analytics.Interfaces;
using Microsoft.Analytics.Types.Sql;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace USQLApplication2
{
    [SqlUserDefinedExtractor]
    public class MyExtractor : IExtractor
    {
        public override IEnumerable Extract(IUnstructuredReader input, IUpdatableRow outputrow)
        {
            char col_delim = ',';
            string line;
            var reader = new StreamReader(input.BaseStream);
            while ((line = reader.ReadLine()) != null)
            {
                var tokens = line.Split(col_delim);
                outputrow.Set("CrimeID", tokens[0]);
                outputrow.Set("Crimetype", tokens[9]);
                yield return outputrow.AsReadOnly();
            }
        }
    }
}

These basic examples should help get started with U-SQL.

Following on from this consider encapsulating the code by placing C# code into its own project, and place U-SQL code into functions.

Look at using Databases, Views and Tables in Azure Data Lake Analytics to store data in a different format from the raw data, for later reuse.

Consider compression on file types to save on storage and IO.