Connecting Azure Data Lake Analytics/Storage with Azure HDInsight Clusters

Recently, it has been announced the security enhancements of Azure HDInsight of securing access to Azure Data Lake store(ADLS)/ADLA(Azure Data Lake Analytics). This options allows to data access & processing data from Azure Data Lake Analytics job to Hadoop batch processing.

In this Team Data Science process sample, where we've analyzed around 20 GB of data of NYC taxi trip using Azure Data Lake Analytics job , processed with ADLS (data lake store) & managing output through HDInsight hive acitivity before building multicase classification algorithm using Azure Machine Learning.

More details on provisioning secure Azure HDInsight cluster with Azure Data Lake can be found in this blog.

The ADLA job after processing looks like this.

adla_job

The output of the job is stored in Azure Data Lake store which is further processed by Hive activity in HDInsight. While provisioning the Azure HDInsight, we need to make sure to enable 'Cluster AAD identity' using ADLS account details.

adls-identityjpg

 

On configuring 'Cluster AAD identity', you need to provide ADLS access details while provisioning the HDInsight cluster like as the following screenshot along with AD service principal details.

 

provision_hdi

 

Once the external hive table is created with the underlying data from Azure Data Lake storage on NYC Taxi trip & fare dataset, we 've created Azure ML models with classification, multiclass classification & regression algorithms.

 

hive

 

 

Final AML dashboard for NYC Taxi trip computation of Tip calculation looks like as this.

nyctaxi_trip