Big Data and Advanced Analytics Virtual Workshop

Article
09/02/2016

I had a fantastic day (Well, night in my timezone!) at Big Data and Advanced Analytics Virtual Workshop - Day 1. You can download the slides and video recording from here .

I did 2 demos in the "Session 5: Using Azure SQL Data Warehouse".

First demo was to set up a Azure SQL Data warehouse, explain the pause and resume feature , scaling Data warehouse and pushing some data into it using bcp. I used following scripts and this file.

Create table in SQL DW

CREATE TABLE DimDate
(
DateId INT NOT NULL,
CalendarQuarter TINYINT NOT NULL,
FiscalQuarter TINYINT NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
);

Run the bcp utility

bcp DimDate in <your directory>\DimDate2.txt -S <your DW Server name> -d <your DW Name>-U <your DW Username> -q -c -t ','

Run SELECT query

SELECT * FROM DimDate

Second demo was more interesting. I was to demo joining tables pointing at 2 other data sources. 1st being the Azure Blob and 2nd being the ORC table in HDInsight (It uses Azure Blob underlying it!).

For Azure Blob-as-a-data source part, I created a container in my blob account. I moved this file to it. Back in SQL Server Management Studio (SSMS), I ran following queries.

Create a data source for Azure Blob

CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://<yourcontainer>@<your-storage-account-path>',
CREDENTIAL = AzureStorageCredential
);

Credential 'AzureStorageCredential' was pre-created by running following queries.

CREATE MASTER KEY;

CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = 'user',
SECRET = '<your storage key>';

Create external table

CREATE EXTERNAL TABLE [asb].DimProduct
(
[ProductID] [int] NOT NULL,
[ProductName] [nvarchar](255) NULL,
[ProductDescription] [nvarchar](500) NULL
)
WITH
(
LOCATION='/Products.txt/' ,
DATA_SOURCE = AzureStorage,
FILE_FORMAT = PipeDeliemitedTextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
)

The schema [asb] and FILE_FORMAT PipeDeliemitedTextFileFormat werepre-created using following queries.

CREATE SCHEMA [asb]

CREATE EXTERNAL FILE FORMAT PipeDeliemitedTextFileFormat
WITH
( FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
)
);

Show data from external table

SELECT * FROM [asb].DimProduct

This explained how data residing in Azure blob can be queried straight from the SQL Data warehouse.

For the HDInsight part, I needed to create a HDInsight cluster and set it up. I ran following queries in HIVE Editor of HDI.

Create an external table in HDI

CREATE EXTERNAL TABLE IF NOT EXISTS <your HDI database>.salestxt
(
salesdate int,
productid int,
salesamt float
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' STORED AS TEXTFILE;

Database was pre-created using script.

CREATE DATABASE IF NOT EXISTS <your HDI database>;

Load data into external table

LOAD DATA INPATH 'wasb:///Sales.txt' INTO TABLE <your HDI database>.salestxt;

I had created this file as Sales.txt. For copying to work, I had to keep it at the root location of HDI container in my storage account.

Create ORC table

CREATE TABLE IF NOT EXISTS <your HDI database>.salesorc
(
salesdate int,
productid int,
salesamt float
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC;

Load data into ORC table

INSERT OVERWRITE TABLE <your HDI database>.salesorc SELECT * FROM <your HDI database>.salestxt;

Once the HDI set up was done, I set up SQL Data warehouse to connect to it. Following queries were used to set it up.

Create data source for HDInsight

CREATE EXTERNAL DATA SOURCE ORCAzureStorage
WITH
(
TYPE = HADOOP,
LOCATION = 'wasbs://<your HDI storage container>@<your storage account path>',
CREDENTIAL = AzureStorageCredential
);

Credential 'AzureStorageCredential' was re-used from the Azure Blob Data Source.

Create external table

CREATE EXTERNAL TABLE [hdp].FactSales (
salesdate int,
productid int,
salesamt float
)
WITH (
LOCATION ='/hive/warehouse/<your HDI database>/salestxt',
DATA_SOURCE = ORCAzureStorage ,
FILE_FORMAT = PipeDeliemitedTextFileFormat
);

FILE_FORMAT 'PipeDeliemitedTextFileFormat' was re-used from the Azure Blob file format. Schema [hdp] was created using query CREATE SCHEMA [hdp]

Show data from external table

SELECT * FROM [hdp].FactSales

I then showed a query output obtained after joining tables pointing to 2 data sources i.e. Azure Blob and HDInsight, with SQL Data warehouse

SELECT * FROM
[dbo].DimDate a,
[asb].DimProduct b,
[hdp].FactSales c
WHERE
a.dateid = c.salesdate AND
b.productid = c.productid

2nd day of the event also promises to be an exciting day with following sessions -

Session 6: Machine Learning in Azure
Session 7: Doing Spark on Azure
Session 8: Running Hadoop Clusters on Azure
Session 9: Cognitive Services and Bots

If you have not registered already, register here. See you there!

Big Data and Advanced Analytics Virtual Workshop

Additional resources