Generate Unique Identifiers (UID) in U-SQL on Azure Data Lake Analytics with Python extension scripts

U-SQL doesn't support constructs to generate Unique Identifier in Text Files. The script below generates unique identifier for every row in the input file.

The steps are

  1. Extract the data file with the EXTRACT statement
  2. REDUCERS are spun based on the customer code. Too little reducers or too many reducers may both cause performance issues. Identify a column that can fairly split, but make sure not to specify a unique column.
  3. For every reduced data set, the python script is invoked with the DATA FRAME. Add another column to the data frame "sguid" and generate a new encoded UID.
  4. The output produced out of the reducer will have a new column sguid

 

REFERENCE ASSEMBLY [ExtPython];

    

DECLARE @ReduceScript = @"

import uuid

import base64

    

def usqlml_main(df):

        df['sguid'] = ''

        df['sguid'] = df.sguid.apply(lambda row: str(base64.urlsafe_b64encode(uuid.uuid1().bytes)))

        return df

";

    

@AllData =    EXTRACT     OrderNo     string,

                        Date             string,

                        CustomerCode     string,

                        ProductCode        string,

                        SalesArea        string,

                        OrderValue        string

            FROM         "/DataLoads/Input/TempFile.csv"

            USING         Extractors.Text(delimiter: ',', skipFirstNRows: 1);

    

@ReducedData =

            REDUCE         @AllData

            ON            CustomerCode

            PRODUCE        sguid string,

                        OrderNo     string,

                        Date             string,

                        CustomerCode     string,

                        ProductCode        string,

                        SalesArea        string,

                        OrderValue        string

            USING         new Extension.Python.Reducer(pyScript:@ReduceScript);

    

    

OUTPUT @ReducedData

TO "/DataLoads/CSVOutputwithGUID.txt"

USING Outputters.Text();

   
 

Note : Follow these instructions to enable U-SQL extensions on your ADL-A account