Incremental data load from Azure Table Storage to Azure SQL using Azure Data Factory

Article
01/23/2016

Azure Data Factory is a cloud based data integration service. The service not only helps to move data between cloud services but also helps to move data from/to on-premises. For example, moving data from Azure blob storage to Azure SQL etc. You can find supported data stores here.

Many business scenario starts with an initial data load, then continual incremental data loads, either hourly/daily/monthly etc. The focus of this post is to explain ADF creation and how incremental data could be loaded.

Prerequisite

Azure Subscription
Azure Table Storage which hosts the source data
Azure SQL DB which hold table to store data
PowerShell/.Net SDK if you want to create ADF solution from said platform. In this blog we’ll be using Azure portal without this requirement.

Setting up the initial scenario

Everyday new data lands in an Azure Table, and needs to be updated on Azure SQL Database on some periodic basis, daily/hourly/monthly depending on the schedule desired. We don’t want to upload the entire table daily rather just want to add new data at destination. In this example I will focus on Employee data defined in next section.

Azure Table storage

Here is the sample data that will be our Copy Activity source. In this example we will be using Azure Storage Explorer.

Here is the schema defined for records.

PartitionKey, RowKey and Timestamp are automatically included for every entity. PartitionKey and RowKey value will be user/developer responsibility to fill in whereas Timestamp value manage by server which can’t be changed/modified. So if schema doesn’t have any property to define when record added then Timestamp property can be used. For more information about Table service model please click here.

As shown in above screenshot, Azure storage explorer 6.0.0 doesn’t show Timestamp property.

Azure SQL Server

Below is the table definition defined in Azure SQL Database that will be our Copy Activity Sink (destination).

Azure Data Factory Solution

In this section, we will be creating an Azure Data Factory.

Creating ADF Service

Go to https://portal.azure.com
Click New
Click Data + Analytics
Click Data Factory

Enter Azure Data Factory Name
Select Subscription
Select/Create new Resource group name

Select Pin to Dashboard
Click Create
Once ADF created, there will be an tile added at home screen.

Click Author and deploy

Creating Data Sources

Azure Storage

- Click New data store

- Select Azure storage
- Provide storage account name and account key

- Click Deploy

Azure SQL

- Click New data store
- Click Azure SQL

- Provide Server name
- Provide Initial Catalog (database name)
- Provide user id and password

- Click Deploy

Creating Dataset

Azure Table

- Click New dataset

- Select Azure table
- Refer article Datasets to understand properties like Structure, Published, type, typeProperties etc.

- Click Deploy

Azure SQL

- Click New dataset

- Select Azure SQL

- Click Deploy

Creating Pipeline

Click New pipeline

Click Add activity and select Copy activity

Add the code for data transformation. It will be look like below once done.

Click Deploy. The diagram will look like below.

Once this pipeline is deployed, it will start processing the data based on the start and end time. In this blog we define in scheduler frequency is Day and interval is 1. That means it will run daily once. We also define start and end time for this pipeline and as per the definition it will run only for one day. For more information on Pipeline execution please refer Scheduling and Execution with Data Factory.

Go back to the Data Factory which we created in the web portal. Click Pipelines and select the pipeline which we created just now.

On summary page click Datasets, Consumed. Notice the Pipeline Activities details shows it executed once.

Go back and click Datasets, Produced. Notice the Slice is in the “Ready” state, meaning data has been transferred. Sometime it will show Pending Execution means it’s waiting to execute. Is progress means it is still running. You may notice Error state in case there is any error while designing/execution of pipeline.

Checking Execution

You can use SQL Server Management studio to confirm if the data arrived or not. Do a select query on destination table to confirm.

Incremental load

Now that we have all the data loaded in destination, the next step is that we want to move incremental data on a daily basis rather than deleting and inserting the whole set of data. In this example, we will be adding some additional records with different dates and will insert data for specific date. Below is sample data added in Azure Table storage.

As an example, for the next date, we’ll move data only for 9-January-2016. To do this first change we will do in pipeline is to fetch specific data from source. To do we’ll add the line below in JSON property azureTableSourceQuery for the data factory pipeline.

"azureTableSourceQuery": "$$Text.Format('RecordAddedDate eq datetime\\'{0:yyyy-MM-ddTHH:mm:ss}\\'', SliceStart)"

The above query will filter record based on the slice Start Time on RecordAddedDate property. $$ is used to invoke data factory macro functions. Since RecordAddedDate is a date-time property we need to add datetime prefix to cast it. In this example we will filter records where RecordAddedDate equals to SliceStart date. We can also define SliceStart, SliceEnd, WindowStart or WindowEnd as parameter.

*Note:- if RecordAddedDate property is not define in table entity, we can use Timestamp property. Please refer Azure Table Storage section at the beginning of this blog.

Add changes and deploy

Once the JSON is deployed, and the Produced dataset state is in Ready status, you can query the SQL Server table to see the second copy activity’s output. Notice it has only transferred data for 09-Jan-2016.

Generally, in such defined scheduler, ADF pipeline executes at 12:00:00 AM daily. In case if required to run pipeline in different time instead of default (12:00:00 AM) let say 6:00:00 AM then add "offset": "06:00:00" property.

Thanks my colleague Jason for reviewing this post.

Hope it’s helpful.

Incremental data load from Azure Table Storage to Azure SQL using Azure Data Factory

Prerequisite

Setting up the initial scenario

Azure Table storage

Azure SQL Server

Azure Data Factory Solution

Creating ADF Service

Creating Data Sources

Azure Storage

Azure SQL

Creating Dataset

Azure Table

Azure SQL

Creating Pipeline

Checking Execution

Incremental load

Additional resources