Last week at Microsoft’s Connect 2016 conference, we announced the General Availability of Azure Data Lake Analytics. As part of the announcement we revealed that U-SQL now includes built-in support for Advanced Analytics scenarios. This includes:
- The ability to perform massively distributed analytics using Python
- The ability to perform massively distributed analytics using R
- Built-in Cognitive capabilities (such as image object detection, sentiment analysis, etc.)
In this post we’ll give a very brief overview of the Python support. We’ll publish additional blog posts that cover R and the Cognitive scenarios later this week. Below is a very simple “Hello World” using Python that illustrates how easy we’ve made it to use Python with U-SQL. This is the simplest script that demonstrates how you can run Python on vertexes using a special built-in Python Reducer. This script shows the key steps:
- using REFERENCE ASSEMBLY to bring in the needed Python support
- using REDUCE to partition the input data on a key
- a built-in reducer (Extension.Python.Reducer) that runs Python code on each vertex assigned to the reducer
- Embedded Python code in the U-SQL script that accepts a pandas DataFrame as input and returns a pandas DataFrame as output.
To learn more about our support for U-SQL Advanced Analytics and how to enable it in your Data Lake Analytics Accounts, see our Getting Started guide.
REFERENCE ASSEMBLY [ExtPython]; DECLARE @myScript = @" def get_mentions(tweet): return ';'.join( ( w[1:] for w in tweet.split() if w=='@' ) ) def usqlml_main(df): del df['time'] del df['author'] df['mentions'] = df.tweet.apply(get_mentions) del df['tweet'] return df "; @t = SELECT * FROM (VALUES ("D1","T1","A1","@foo Hello World @bar"), ("D2","T2","A2","@baz Hello World @beer") ) AS D( date, time, author, tweet ); @m = REDUCE @t ON date PRODUCE date string, mentions string USING new Extension.Python.Reducer(pyScript:@myScript); OUTPUT @m TO "/tweetmentions.csv" USING Outputters.Csv();