Disclaimer: The views expressed on this website/blog are mine alone and do not reflect the views of my company. All postings on this blog are provided “AS IS” with no warranties, and confers no rights.

Azure Data Factory V2– Pipeline Functional Testing

November 18, 2018 Leave a comment

This article helps you to understand functional testing with pipelines in Azure Data Factory.

Note:-

This article applies to Azure Data Factory V2.

This article assumes that you have gone through Introduction to Azure Data Factory and the QuickStart Tutorial.

Credits – Thanks to ADFv2 Product Group (Anudeep Sharma, Hermine Vermishyan & Abhishek Narain ) for reviewing and all assistance with my ADFv2 and my partner in awesomeness David Kuo for testing the script

Overview

During development phase of an Azure Data Factory pipeline you have a debugging functionality in UX which you can use to debug a pipeline and find out if anything is failing due to reasons like incorrect connection string, object not present in database, wrong query, etc.

Keeping same idea in mind I have extended the functionality of debugging from pipelines’ development stage to release stage, where you can use PowerShell script to perform pipelines’ functional testing in Test or Integration or UAT environment ensuring all issues are caught at that stage instead of hitting in Production environment and in this article series will share the process of achieving same with PowerShell script and good news is that you can use same script in your environment without making any code change.

Concept

Function test with ADF can be used to validate if pipeline is working and if anything related to functionality is failing then stop release, fix issue and iterate. It’s more aligned with build verification ensuring objects like linked service, tables, storage, etc. which are required for a pipeline are successfully validated before releasing to production environment.

Scope

Testing functionality of pipelines ensuring configurations like databases, server, storage, etc. are configured correctly and pipeline is not failing due to misconfiguration like database not exist or something on same lines. Errors due to data quality, performance or anything related to data sanctity is outside the scope of functional testing.

How it Works

Let’s go in details of understanding working of functional testing with ADFv2.

First of all you need to understand how debugging works in pipeline and good news is that there is very informative video done by Gaurav Malhotra from ADF team (highly recommended if you want to get insight of pipeline debugging): Iterative development and debugging using Data Factory

As you have seen in this video that for debugging you need to run a pipeline and for running a pipeline it requires parameters defined for your pipeline and run the pipeline in debug mode with an output showing whether pipeline succeeded or failed, keeping same approach in mind I attempted to write this functionality with following flow:-

Functional Testing Flow: –

clip_image002

Now let’s look into each step in this flow:-

1. Azure Active Directory (AAD) Authentication – AAD is required for silent authentication of PowerShell script which is used during automation of testing and PowerShell script needs to access Azure Data Factory for getting list of pipelines so we need to ensure Application ID should also have access to Azure Data Factory. There are few handy articles which you can follow for creating Application ID mentioned as follow:-

a. How to create an AAD application and service principal

b. How to give access to Application ID

i. Go to Azure Data Factory in Azure Portal

ii. Under All settings select Users and click Add as shown in following figure: –

Settings: –

clip_image004

iii. After hitting Add it will open another windows which will allow you to add Application ID which you had crated in Step1, as there is need of only getting list of pipelines so we are making Application member of Reader Role as shown in following figure where Application name is FunctionalTesting and added member of Reader Role:-

clip_image006

iv. In PowerShell we will be leveraging Application ID for silent authentication and its recommended to use Key Vault for pulling secrets in this case we need a Application ID secret (password), in PowerShell I have parameterized secret field so you can either pass it as plain text or make a call to Key Vault during release process and pass along the secret. Code snippet as follow in which we are doing silent authentication using Application ID which is also knowns as Service Principal, you can refer this article for getting TenantID

clip_image008

2. Functional Testing Logic –

a. Get list of Pipeline from ADF

clip_image010

b. Check for Pipeline Exclusion

clip_image009

Example of exclusion file:-

clip_image012

c. Pipeline Test file – Following is an example of test file, in my case I have a pipeline named as pl-testValidation in data factory and that pipeline expects a parameter known as SubFolder, I need to ensure then I am creating a parameter file with respect to my pipeline which will be used a functional test file by testing logic, also need to ensure that test file name matches with the pipeline name and should have all parameters stated in a pipeline, and if this file is missing then it will be marked as failed with missing test file

Example of test file as the pipeline name is pl-testValidation so I have created test file as pl-testValidation.json (PipelineName.json) with a parameter consumed by pipeline and TestTimeout value as I want to run this test for 80 seconds, if I will miss this value then it will default to 30 seconds.

clip_image014

d. In following code if you notice we are iterating through pipelines and checking if PipelineName.json exist and also checking what is the TestTimeout value if missing then defaulting it to 30 seconds.

clip_image016

e. Running Pipeline test by using following command

clip_image018

f. Using value of $runId to get current status (refer GIT – PS script for full code)

clip_image020

g. Using value of $runId to get status (refer GIT – PS script for full code)

clip_image021

h. Fail test if it exceeds timeout setting

clip_image023

i.If you are integrated with Release Manager then it will generate exception in-case of failure resulting in failed release and give an output of failed pipelines which needs to be fixed before deploying to production environment but don’t be dis-heartened you can still code even though you are not using Release Manager by using following command.

.\AzureDatafactoryFunctionalTesting.ps1 -dataFactoryName "Your Factory Name" -resourceGroupName "Factory Resource Group Name" -azureSubscription "Subscription ID" -applicationSecret "Application Secret from Point 1" -applicationID "Application ID from Point 1" -tenantID "Tenant ID from Point 1" -excludedPipelinesFilePath ".\" -functionalTestJsonFilePath ".\"

Conclusion

As you have seen in this article I have gone through step by step of achieving functional testing with ADFv2 and in subsequent article I will be discussing more about how to integrate same with Release Manager so that you can have this testing framework as a part of Pipeline Release process to Production.

Code Location

Git – feel free to download and contribute

Categories: Uncategorized

ADF – Azure Data Factory Learning Path

If you are looking to learn Azure Data Factory (ADF) it’s a good path to start, as it covers various components of ADF like all kind of supported flows with code samples.

I would suggest to start with Portal as you don’t require any coding experience on same and if you are coming from DevOps or DBA background then you can follow PowerShell path, further if you are coming from pure development background then you can use APIs provided by ADF through Nugget.

Here is link of learning path:-
https://azure.microsoft.com/en-us/documentation/learning-paths/data-factory/

Following link shows how Microsoft Studios (XBOX) team is using Azure Data Factory:-
https://channel9.msdn.com/Blogs/raw-tech/Halo-Minecraft-and-More-How-Microsoft-Studios-Processes-Gaming-Event-Data

Happy Learning

Categories: Uncategorized

HDInsight: How to find Long Running Queries

January 25, 2017 Leave a comment

Coming from SQL background where you have dependency on SQL Profiler Trace for understanding what’s going on SQL Server like who is running non optimized long query, where its scanning through all partitions and yes who is using non indexed columns for filter right? So coming to HDInsight world wondering do we have something like SQL Profiler for tracking bad jobs – Yes we do, as a part of Apache Ambari Project – (as per wiki) Whole objective of Ambari is to make Hadoop management easy by providing a mechanism of managing or monitoring Hadoop  clusters with it’s easy to use  UI, in this Post I will show how to use Ambari for getting long running or resource intensive jobs.

Objective: How to find long running or resource intensive Hive query?
Steps:-

clip_image001[4]

 

  • Click Yarn > Quick Links > Active Head Node >>Resource Manager UI

clip_image002[5]

Or if you don’t want to go through the whole hassle just type https://<HDInsightClusterName>.azurehdinsight.net/yarnui/hn/cluster, in popup provide security information admin and its password – and it would open Resource Manager – it provides you all information about jobs (running/pending/finished/etc.) on HDI cluster

  • In following screenshot – you can notice multiple options for checking Jobs like in this case I have this specific job which is consuming:-
    • A – Memory – 2.2 TB
    • B – % of Queue – 64%
    • C – % of Cluster – 60%

image

  • In this case filtered job may be problem making query which is causing other jobs to wait in queue as its consuming 60% of resource and 2.2 TB of total 3.75TB allocated for this cluster. I can further look into job by clicking into job id (Job ID > Logs and Pull query from there dag) and check how’s query look like and what’s its doing under the hood – is it scanning TBs of partitions or filtering on non-partitioned fields or what.

clip_image004

 

  • Like SQL You can Kill this job from Resource Manager it self – Click on Job and on top you will find a button for killing same query, like this job is running for last 7 hours, you can Kill it by clicking Kill Application

image

 

Hope this will come handy.

Feel free to add comments, may be in next post I can take a deep dive into each log and action.

Happy Learning.

(Published on my  blogs https://karanspeaks.com/ & https://blogs.msdn.microsoft.com/)

Temporary Post Used For Theme Detection (45286bb8-790b-401a-869b-3660278f5b19 – 3bfe001a-32de-4114-a6b4-4005b770f6d7)

January 25, 2017 Leave a comment

This is a temporary post that was not deleted. Please delete this manually. (0cc580d6-e4cf-4859-8a96-2556a3c75393 – 3bfe001a-32de-4114-a6b4-4005b770f6d7)

Categories: Uncategorized

SSAS Monitoring Tool–SQL 2014 ASTrace

November 30, 2016 Leave a comment

 

Refer SSAS Monitoring Tool– ASTrace to learn more about ASTrace, this post was pending for a long time as I got tons of email and comment around SQL 2014 and 2016 ASTrace, whith this blog like to mention that I have shared a source code on git so anyone can contribute to same.

 

Code for ASTrace 2014 is tested with latest build of SQL Server 2014 but for 2016 I had tested with SQL 2016 Preview and didn’t get chance to test with RTM so in coming few days will test and update the code as well if required, but feel free to update if you like as its Git so anyone can contribute.

 

Here is Git Link – https://github.com/karanspeaks/SQL-Server-Analysis-Services-Samples

Categories: Uncategorized

Halo, Minecraft, and More! How Microsoft Studios Processes Gaming Event Data

February 18, 2016 Leave a comment
Categories: Uncategorized

Halo, Minecraft, and More! How Microsoft Studios Processes Gaming Event Data

February 18, 2016 1 comment

My first video on Channel 9, it talks about how we Handle various Events generated from Games………..

https://channel9.msdn.com/Blogs/raw-tech/Halo-Minecraft-and-More-How-Microsoft-Studios-Processes-Gaming-Event-Data/player

Categories: Uncategorized

Azure Data Factory: Detecting and Re-Running failed ADF Slices

November 13, 2015 Leave a comment

 

Recently I came across  a scenario where I need to detect failed slices of all Datasets in Azure Data Factory, in my case I need to detect for last 3 months and the number of slices was around 600+, they failed due to validation error as the source data wasn’t present and after a number of re-try slices were marked as failed.

In such cases its difficult to perform re-run from the  Portal as you need to right click on each slice and run it explicitly.

Solution: I wrote a following PowerShell Script, this script will detect all failed slices in a given Azure Data Factory and re-run same with your consent.

You can use same script not only for failed slices but for any status, you just need to change the Dataset status in filtering of slices, shown in following script.

I am also planning to write a solution which will run as a service in a worker role and automatically detect failed slices in a given time and re-run same.

Question can be asked, that in ADF you already have re-run logic they why you need to go through the hassles of writing and running script.

Yes we do have but after x number of re-runs a slices is marked as failed and only way is to run is through portal or programmatically.

So, here is my contribution to ADF Community.

Pre-requisite – Azure Resource Manager PowerShell (https://azure.microsoft.com/en-us/blog/azps-1-0-pre/)

Copy following code in text file and save it as file.ps1

You can also download the script and save it as PS1 – https://karanspeaks.com/wp-content/uploads/2015/11/re-run-failed-slices-ps11.docClick Here

#Begin Script

Login-AzureRmAccount
$slices= @()
$tableName=@()
$failedSlices= @()
$failedSlicesCount= @()
$tableNames=@()

$Subscription="Provide Subscription ID" 
 
  Select-AzureRMSubscription -SubscriptionId  $Subscription   
$DataFactoryName="Provide Data Factory Name"
$resourceGroupName ="Porvide Resource Group Name for Data Factory"
 
$startDateTime ="2015-05-01" #Start Date for Slices
$endDateTime="2015-08-01" # End Date for Slices

#Get Dataset names in Data Factory – you can explicitly give a table name using $tableName variable if you like to run only for an individual tablename
$tableNames = Get-AzureRMDataFactoryDataset -DataFactoryName $DataFactoryName -ResourceGroupName $resourceGroupName | ForEach {$_.DatasetName}

$tableNames #lists tablenames

foreach ($tableName in $tableNames)
{
    $slices += Get-AzureRMDataFactorySlice -DataFactoryName $DataFactoryName -DatasetName $tableName -StartDateTime $startDateTime -EndDateTime $endDateTime -ResourceGroupName $resourceGroupName -ErrorAction Stop
}

$failedSlices = $slices | Where {$_.Status -eq ‘Failed’}

$failedSlicesCount = @($failedSlices).Count

if ( $failedSlicesCount -gt 0 )
{

    write-host "Total number of slices Failed:$failedSlicesCount"
    $Prompt = Read-host "Do you want to Rerun these failed slices? (Y | N)"
    if ( $Prompt -eq "Y" -Or $Prompt -eq "y" )
    {

        foreach ($failed in $failedSlices)
        {
               write-host "Rerunning slice of Dataset "$($failed.DatasetName)" with StartDateTime "$($failed.Start)" and EndDateTime "$($failed.End)""
            Set-AzureRMDataFactorySliceStatus -UpdateType UpstreamInPipeline -Status Waiting -DataFactoryName $($failed.DataFactoryName) -DatasetName $($failed.DatasetName) -ResourceGroupName $resourceGroupName -StartDateTime "$($failed.Start)" -EndDateTime "$($failed.End)"
            $failed.DatasetName

        }
    }
           
}
else
{
    write-host "There are no Failed slices in the given time period."
}

#End Script

Categories: Uncategorized

Microsoft Azure Essentials: Azure Machine Learning

April 16, 2015 Leave a comment

 

Microsoft Azure Essential: Azure ML Free book is pretty informative if you wanted to learn about ML.

Link as follow:

Download the PDF (5.84 MB)

Additionally I recommend to go through free courser from University of Washington available through coursera.

Step by Step course of going in depth of Machine Learning.

Link as follow:

https://www.coursera.org/course/machlearning

Categories: Uncategorized

Informative HDInsight Links

January 20, 2015 Leave a comment

PDF Document download link CLICK – Here

Tried to collate informative articles related to HDInsight.

In comments section please add URLs of article / video which you came across and like me to add it to this list.

Links as follow:-

Get started with HDInsight

Get started using Hadoop with Hive in HDInsight to analyze mobile handset use

Get started with HDinsight Emulator

HDInsight Storm overview

Getting started using Storm with HDInsight (Hadoop)

Here are the Hadoop technologies in HDInsight:

Ambari: Cluster provisioning, management, and monitoring

Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment

HBase: Non-relational database for very large tables

HDFS: Hadoop Distributed File System

Hive: SQL-like querying

Mahout: Machine learning

MapReduce and YARN: Distributed processing and resource management

Oozie: Workflow management

Pig: Simpler scripting for MapReduce transformations

Sqoop: Data import and export

Storm: Real-time processing of fast, large data streams

Zookeeper: Coordinates processes in distributed systems

Other popular Videos

Channel9 video: Scenarios for HDInsight

Channel9 video: Get started with Azure HDInsight with Matt Winkler

Books

Introducing Microsoft Azure HDInsight
In Introducing Microsoft Azure HDInsight, we cover what big data really means, how you can use it to your advantage in your company or organization, and one of the services you can use to do that quickly—specifically, Microsoft’s HDInsight service. We start with an overview of big data and Hadoop, but we don’t emphasize only concepts in this book—we want you to jump in and get your hands dirty working with HDInsight in a practical way.

clip_image002Download the PDF (6.37 MB)

Categories: Uncategorized