The boundaries between on-premise and cloud-born data continue to blur with more and more organization moving to hybrid data landscapes. The blurring of these lines introduces a number of complexities from data locality, to scale and even the diversity of the data which we must consume and processed. To meet these challenges head-on, Microsoft introduced a new managed cloud service known as Azure Data Factory (ADF).
Four Pillars of Azure Data Factory
So what is Azure Data Factory? That question is best answered by describing the four (current) pillars of ADF.
- Moving on-premise data to the cloud (and cloud data to on-premise)
One of the primary use-cases of Azure Data Factory is as a pipeline to move or copy data across the on-premise/cloud boundary. Whether ingress or egress from the cloud, ADF is built to facilitate not only the orchestration of data movement but the transformations necessary to move your data to and from a variety of sources and destinations. The currently supported sources and destinations are: SQL Server (On-Premise), Azure Blobs, Azure Tables and Azure SQL databases. Please also note also that ADF can be used to move data from one cloud source to another.
- Processing data at scale (HDInsight)
While moving data from point A to point B is useful in a number of situations, often times it is necessary to transform and enrich your data along the way. Handling this processing can be non-trivial and when scale is an issue. The Azure Data Factory can handle processing data of any size by leveraging HDInsight. Using either a cluster of your own or one that Azure Data Factory creates on demand, you can process, transform and enrich your data using C#, Pig, Hive and even MapReduce.
- Integrate and enrich disparate data
The diversity of data and how you bring it all together presents interesting and unique challenges. How do you process that JSON data from your social media platform to enrich the more traditional and relational data from your on-premise CRM system? What about all that XML data provided by your partner or that old-school order management system? With Azure Data Factory, you can build both complex (and trivial)workflows, orchestrating the outputs from one pipeline into the inputs of another. For example, use one pipeline to shred that JSON data with a Pig Script, while another loads your transactional CRM data from an on-premise SQL Server into Azure blob storage. A third pipeline could then take both outputs, integrating them together using Hive or a Custom Activity written in C#.
- Orchestrate & Monitor data processing workflows
If you done any amount of ETL work, you know first hand that its one thing to build a workflow and yet another to fully operationalize it. Too often in the ETL work things like logging and monitoring are after thoughts. Not with Azure Data Factory. Whether you use the Azure Dashboard Portal or PowerShell, you not only have robust access to the details of every process or workflow in your Data Factory but out of the box alerting in the event that something does go awry.
So now that you have an idea of what the Azure Data Factory is, I suspect that some of you like me are itching to start using it. Over the next couple of posts, we will dig into the Data Factory, explore specific use cases and walk-through functional examples. If you would like to get a jump start, I’d invite you to check out the documentation and examples over at: http://azure.microsoft.com/en-us/documentation/services/data-factory/.
Lastly, keep in mind that Azure Data Factory is currently in preview. If you haven’t previously worked with any of the Azure Services in Preview that basically means that it can be will be subject to changes with little to no notice as it evolves towards a more formal GA release.