U-SQL Advanced Analytics: Introducing Python Extensions for U-SQL


Last week at Microsoft's Connect 2016 conference, we announced the General Availability of Azure Data Lake Analytics. As part of the announcement we revealed that U-SQL now includes built-in support for Advanced Analytics scenarios. This includes:

  • The ability to perform massively distributed analytics using Python
  • The ability to perform massively distributed analytics using R
  • Built-in Cognitive capabilities (such as image object detection, sentiment analysis, etc.)

In this post we'll give a very brief overview of the Python support. We'll publish additional blog posts that cover R and the Cognitive scenarios later this week. Below is a very simple "Hello World" using Python that illustrates how easy we've made it to use Python with U-SQL. This is the simplest script that demonstrates how you can run Python on vertexes using a special built-in Python Reducer.  This script shows the key steps:

  1. using REFERENCE ASSEMBLY to bring in the needed Python support
  2. using REDUCE to partition the input data on a key
  3. a built-in reducer (Extension.Python.Reducer) that runs Python code on each vertex assigned to the reducer
  4. Embedded Python code in the U-SQL script that accepts a pandas DataFrame as input and returns a pandas DataFrame as output.

To learn more about our support for U-SQL Advanced Analytics and how to enable it in your Data Lake Analytics Accounts, see our Getting Started guide.

REFERENCE ASSEMBLY [ExtPython];

DECLARE @myScript = @"
def get_mentions(tweet):
    return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )

def usqlml_main(df):
    del df['time']
    del df['author']
    df['mentions'] = df.tweet.apply(get_mentions)
    del df['tweet']
    return df
";

@t  = 
    SELECT * FROM 
        (VALUES
            ("D1","T1","A1","@foo Hello World @bar"),
            ("D2","T2","A2","@baz Hello World @beer")
        ) AS 
              D( date, time, author, tweet );

@m  =
    REDUCE @t ON date
    PRODUCE date string, mentions string
    USING new Extension.Python.Reducer(pyScript:@myScript);


OUTPUT @m
  TO "/tweetmentions.csv"
  USING Outputters.Csv();

 


Comments (5)

  1. Jonathan says:

    Hi,
    how do use numpy in this context in terms of loading modules?
    Do i have to load numpy as an import statement?

    1. Saveen Reddy says:

      Yes, your script has to explicitly "import numpy"

      1. Robert Alexander says:

        I guess I would be correct in assuming that "gensim" and "nltk" would have to be similarly imported.

  2. Uli says:

    Hi,
    I only get the messages "Assembly master.ExtPython" does not exist.
    regards,
    Uli

    1. Uli says:

      Hi,
      just found that python assemblies are copied as part of U-SQL extensions. See the "getting started guide" above.

      Uli

Skip to main content