APS Polybase to Azure Blob Store

With the current growth of Azure and the array of tools available in Azure to perform advanced data analytics the need to move data to Azure has become ever more popular.

One of the simplest methods to do so from the APS Appliance is through Polybase queries, where you can move data to Azure blob store via a CETAS (Create External Table As Select) operation.

 

Prerequisites

Before you start configuring the appliance and creating objects you need to ensure that the following prerequisites have been completed.

  • Network connectivity to *.blob.core.windows.net through Port 80 & Port 443 from the Control Node
  • Azure Storage Account has been configured

Configuring Azure Storage Account

To successfully write to the Azure Storage account, the following configuration guidelines should be followed

Classic Storage Account Deployment Model

The only requirements when making use of a Classic Storage account is to ensure that either one of the following Replication models is used

  • Local Redundant Storage (LRS)
  • Geo Redundant Storage (GRS)

Completing the Classic storage account configuration, I made use of pdwtest as my storage account name which creates pdwtest.core.windows.net. The account specified can be any valid storage account name. I opted for LRS and used a new Resource group, you can use existing resource groups within your subscription

Once the storage account has been created, complete the following actions to prepare your storage account for CETAS operation

Open the storage account from the Storage Accounts list and Select Blobs under Services

It will state that you don’t have any containers yet, create a container by selecting +Container and provide a valid name for the container, leaving it at Private Access type.

 

Once created you will have a container within the blob store to which your files will be written to.

Last step would be to obtain the storage account key which you would require updating the core-site.xml to authenticate to the blob store. This can be found within the Access Keys tab under settings on your storage account. It provides a primary and secondary access key, copy the access key by selecting either one of the yellow highlighted buttons and paste the information in temporary notepad txt file as you would be needing this later.

Resource Manager Deployment Model

The resource manager deployment is identical to the classic storage account with only one exception, when creating the storage account, it is important to select “General purpose” and NOT “Blob Storage” from the Account kind drop down selection. (The Account kind selection is only available with resource manager based storage accounts.)

Complete the storage account creation as specified in the classic storage account creation and proceed with container creation and obtaining the storage account key from the Account Keys section.

Polybase Configuration

Completing the polybase configuration is detailed within the APS help file however for completion purpose I will provide the steps within the post as well.

Connect to the appliance using a query tool such as SQLCMD or SQL Server Data tools with elevated permissions to ensure that you can perform the following configuration changes

EXEC sp_configure 'hadoop connectivity',4

reconfigure

 

Option 4 enables connectivity to Microsoft Azure blob storage

For detailed information on the various options refer to APS Help file ‘Configure Polybase Connectivity to External Data’

Once you have made the configuration change modify the following file on the PDW Control Node

C:\Program Files\Microsoft SQL Server Parallel Data Warehouse\100\Hadoop\conf\core-site.xml

Add the following property with name and value attributes to the core-site.xml file

<property>

<name>fs.azure.account.key.<your storage account name>.blob.core.windows.net</name>

<value>your storage account access key</value>

</property>

As per the exercise my settings would be

<property>

<name>fs.azure.account.key.pdwtest.blob.core.windows.net</name>

<value>key value which we copied out in creation process</value>

</property>

Save the change to the core-site.xml

Once you have completed the configuration changes it is required to restart the PDW Region for the changes to take effect. Do so during maintenance time slot to prevent any unexpected outage.

 

Performing the Polybase Query

Once the region has been restarted and the appliance is configured to perform the CETAS query we must create the external data source and file format which will be used to write the file.

Prerequisites

There are 2 prerequisites for creating an external table once the appliance has been configured

  • Creating an External Data source
  • Creating an External File format

Create External Data Source

From the User database from where I will be running my CETAS statement I create the External data source to my polytest container in my pdwtest storage account. (I used the AdventureworksPDW2012 database as my data source)

CREATE EXTERNAL DATA SOURCE MyAzureDS

WITH (

TYPE = HADOOP,

LOCATION = 'wasbs://polytest@pdwtest.blob.core.windows.net'

)

 

Once I have created my external data source I need to define the file format of my target file on the azure blob store.

For detailed information on External Data sources refer to APS Help file “Create External Data source” or MDSN Article https://msdn.microsoft.com/en-us/library/dn935022.aspx

 

Create External File format

The external file format defines how the data will be stored on the Azure blob store. The file format will define the file format type, field delimiters, field terminators, string delimiters, date format and data compression if required.

For detailed options on the various file formats refer to the APS help file “Create External File Format (SQL Server PDW )” or MSDN Article https://msdn.microsoft.com/en-us/library/dn935026.aspx

For exercise purpose, I have created a Pipe Delimited format

CREATE EXTERNAL FILE FORMAT PipeFormat

WITH (

FORMAT_TYPE = DELIMITEDTEXT

)

 

Performing CETAS Operation

For details on CETAS operation and syntax refer to (https://msdn.microsoft.com/en-us/library/mt631610.aspx)

Now that all the requirements have been met I am ready to execute my CETAS statement.

From my source user database, I execute the following statement

CREATE EXTERNAL TABLE MyWasbTest

WITH

(

LOCATION='/dimcust', -- Target which I want to write to

DATA_SOURCE = MyAzureDS, -- Specify The External Data source I created

FILE_FORMAT = PipeFormat, -- Specify The File format I created

REJECT_TYPE = VALUE,

REJECT_VALUE = 10

)

AS

SELECT Top 10 *

from dim.Customer

 

Once my statement has completed successfully the file with its content will be visible from my Azure Dashboard in the target Container

The actual file can be found within the dimcust folder with the executing queryid which performed the CETAS operation.

Conclusion

With little effort, you can move data to Azure blob store for consumption within Azure DW or other services currently available in Azure.

The article did not cover any specific user security requirements within the appliance, for detailed information please refer to following msdn article.

https://msdn.microsoft.com/en-us/library/mt631610.aspx

The APS 2016 Online documentation can be found here

https://msdn.microsoft.com/sql/analytics-platform-system/index

The APS AU5 Help file can be found here

https://www.microsoft.com/en-us/download/details.aspx?id=51610