Different approaches to inject data into Azure SQL Data Warehouse


Different approaches to inject data into Azure SQL Data Warehouse

For the SQL Data Warehouse customers, the primary objective when migrating data is to populate their SQL Data Warehouse database. This process can be achieved in a number of ways, SQL Server Integration Services(SSIS) supports 3 different approaches to inject data into SQL Data Warehouse as below:

Data flow with ADO NET Destination

Bulk insert data into SQL Data Warehouse (Data flow with OLE DB Destination)

Execute SQL Task to trigger Polybase

The setting of Data flow with ADO NET Destination for the SQL Data Warehouse is same with the normal SQL Server Database(https://msdn.microsoft.com/en-us/library/bb895291.aspx), and the performance of Data flow with OLE DB Destination is better than the Data flow with ADO NET Destination, so Data flow with ADO NET Destination will not be discussed in this article. This article will focus on the Data flow with OLE DB Destination and Execute SQL Task to trigger Polybase.

Prerequisites

A SQL Data Warehouse database

SQL Server Data Tools

Loading data into SQL Data Warehouse via Polybase

Polybase is the fastest solution to load data into SQL Data Warehouse. We can use Execute SQL Task to trigger it in SSIS. The limitation of the Polybase is it only supports the flat file on Azure blob as data source, so we need to move our data to Azure blob first.

  1. Create an Azure blob account with container if you don’t have one. (You can get more detail about how to create the Azure blob storage account from here and get more detail about how to create container from here)
  2. There are two approaches to upload your data to the Azure blob via SSIS. Both approaches are depending on the Azure feature pack for integration service, so you need to install the it first.
  • You can use a data flow as below to transfer your data to Azure Blob Destination. It is straightforward. The throughput of using data flow to upload the flat file is about 10M/S.

1

  • You can also use the Azure Blob Upload Task to upload your flat file to Azure blob, the speed of this solution is much faster than using data flow to upload your data to Azure blob destination. If the data source is not flat file, we recommend you to use data flow convert it into flat file first, then use Azure blob upload task to upload it to Azure blob, in order to get better performance. The throughput of Azure blob upload task can reach 200M/S.
    Untitled
  1. Create an external table according to your data source. (more detail about how to create external table can be found here)
  2. Create an Execute SQL Task to load data into SQL Data Warehouse, you can drag an Execute SQL Task to the package and set the SQL statement as below:
    CREATE TABLE [dbo].[Destination] WITH
    (
    CLUSTERED COLUMNSTORE INDEX,
    DISTRIBUTION = ROUND_ROBIN
    )
    AS SELECT * FROM [dbo].[ DestinationExternal];

Now, you can run the package to upload data into the SQL Data Warehouse. The throughput of the loading data from Azure blob to SQL Data Warehouse via Polybase is more than 200M/S.

Loading data into SQL Data Warehouse via SSIS OLEDB component

 

User also can load data into SQL data warehouse by OLEDB which will use BULK INSERT way to do staging, it just like what you do for SQL SERVER, so you do not need to change your existing packages.

In our test, we run 1 SSIS instances to load data into column store table  and the speed is 13 MB/s.

Meanwhile, you can use multiple partitions to load data into data warehouse and merge partitions into a big table, in our test, we use 32 table partitions with column store index to load data, and the speed is 59.5 MB/s.   Here are steps we did for parallel data loading

  1. Prepare your source data into 32 partitions , we use DBGEN tool to create 32 LINELITEM table data flat file.
  2. Prepare your table in Azure data warehouse, we create a main table and 32 partition tables, here are script for creating these tables
    i.Main table

CREATE TABLE  LINEITEM_SCC  (

[L_ORDERKEY]      BIGINT          NOT NULL,

[L_PARTKEY]       INT             NOT NULL,

[L_SUPPKEY]       INT             NOT NULL,

[L_LINENUMBER]    INT             NOT NULL,

[L_QUANTITY]      DECIMAL (15, 2) NOT NULL,

[L_EXTENDEDPRICE] DECIMAL (15, 2) NOT NULL,

[L_DISCOUNT]      DECIMAL (15, 2) NOT NULL,

[L_TAX]           DECIMAL (15, 2) NOT NULL,

[L_RETURNFLAG]    CHAR (1)        NOT NULL,

[L_LINESTATUS]    CHAR (1)        NOT NULL,

[L_SHIPDATE]      DATE            NOT NULL,

[L_COMMITDATE]    DATE            NOT NULL,

[L_RECEIPTDATE]   DATE            NOT NULL,

[L_SHIPINSTRUCT]  CHAR (25)       NOT NULL,

[L_SHIPMODE]      CHAR (10)       NOT NULL,

[L_COMMENT]       VARCHAR (44)    NOT NULL

)

WITH

(

DISTRIBUTION = ROUND_ROBIN

,   PARTITION   (   [L_ORDERKEY] RANGE LEFT FOR VALUES

(18750000, 37500000, 56250000, 75000000,

93750000, 112500000, 131250000, 150000000,

168750000, 187500000, 206250000, 225000000,

243750000, 262500000, 281250000, 300000000,

318750000, 337500000, 356250000, 375000000,

393750000, 412500000, 431250000, 450000000,

468750000, 487500000, 506250000, 525000000,

543750000, 562500000, 581250000

)

)

)

ii.Partition tables

declare @tblStr varchar(8000)

declare @intpartition int

set @intpartition =1

while @intpartition <= 32

begin

set @tblStr = ‘CREATE TABLE  LINEITEM_SCC’+cast(@intpartition as varchar(10))+’ (

[L_ORDERKEY]      BIGINT          NOT NULL,

[L_PARTKEY]       INT             NOT NULL,

[L_SUPPKEY]       INT             NOT NULL,

[L_LINENUMBER]    INT             NOT NULL,

[L_QUANTITY]      DECIMAL (15, 2) NOT NULL,

[L_EXTENDEDPRICE] DECIMAL (15, 2) NOT NULL,

[L_DISCOUNT]      DECIMAL (15, 2) NOT NULL,

[L_TAX]           DECIMAL (15, 2) NOT NULL,

[L_RETURNFLAG]    CHAR (1)        NOT NULL,

[L_LINESTATUS]    CHAR (1)        NOT NULL,

[L_SHIPDATE]      DATE            NOT NULL,

[L_COMMITDATE]    DATE            NOT NULL,

[L_RECEIPTDATE]   DATE            NOT NULL,

[L_SHIPINSTRUCT]  CHAR (25)       NOT NULL,

[L_SHIPMODE]      CHAR (10)       NOT NULL,

[L_COMMENT]       VARCHAR (44)    NOT NULL

)

WITH

(

DISTRIBUTION = ROUND_ROBIN

,   PARTITION   (   [L_ORDERKEY] RANGE LEFT FOR VALUES

(18750000, 37500000, 56250000, 75000000,

93750000, 112500000, 131250000, 150000000,

168750000, 187500000, 206250000, 225000000,

243750000, 262500000, 281250000, 300000000,

318750000, 337500000, 356250000, 375000000,

393750000, 412500000, 431250000, 450000000,

468750000, 487500000, 506250000, 525000000,

543750000, 562500000, 581250000

)

)

)’

set @intpartition =  @intpartition + 1

exec(@tblStr)

end

  1. Design package and every package running instance is dedicated to ingest one source flat file to one partition table in Azure data warehouse.  For example,  package running instance 1 is used to ingest flat file 1 to partition 1. Here is the package we used to do ingestion:

Data Flow:

3

Data Flow settings:

4

OLEDB component settings:

5

DB connection settings:

“Data Source=***.database.windows.net;User ID=***;Password=***;Initial Catalog=***;Provider=SQLNCLI11.1;Auto Translate=False;Packet Size=32767;”

  1. After source partitions are ingested into partition tables, we merge partition tables into main table, for our test, here are command we used to do merge

declare @tblStr varchar(8000)

declare @intpartition int

set @intpartition =1

while @intpartition <= 32

begin

set @tblStr = ‘ALTER TABLE LINEITEM_SCC’+cast(@intpartition as varchar(10))+

‘ SWITCH PARTITION ‘+cast(@intpartition as varchar(10))+

‘ TO  LINEITEM_SCC PARTITION ‘ +cast(@intpartition as varchar(10)) +’;’

set @intpartition =  @intpartition + 1

exec(@tblStr)

end

  1. Double check all you data are merged into main table.

There are several parameters you want to tune in OLEDB components for different scenarios, for example, you might use heap table or you may want to use multiple process to stage data into same table, please refer to https://blogs.msdn.microsoft.com/sqlcat/2015/03/11/data-loading-performance-considerations-with-clustered-columnstore-indexes/ for more details.

Summary

If the data size is small (such as less than 100G), the performance will not be a problem, we suggest you to use the bulk insert to load data into SQL Data Warehouse, because it is easy to configure. Otherwise, we suggest you to use the Polybase solution. Per our test, without any partition, the throughput of loading flat file into SQL Data Warehouse is about 130M/S. You can also partition the source data into several files to make the data loading more fast. We used 8 SSIS processes running on 4 virtual machines to load data into SQL Data Warehouse, the throughput can reach 300M/S.


Comments (1)

Skip to main content