The SQL Server 2012 Parallel Data Warehouse is ready to roll out and it is a BigData big deal

The SQL Server 2012 Parallel Data Warehouse (PDW 2012) contains a number of impressive improvements one of which is integration with Hadoop. PDW V1 featured integration with Hadoop by implementing a SQOOP based import/export connector. This new feature is named Polybase and in short, Ploybase allows you to issue a TSQL query that accesses data stored in a Hadoop cluster and in the PDW. Polybase replaces the SQOOP connectors and implements parallel data transfers between the PDW nodes and a Hadoop cluster HDFS nodes making data movement massively parallel and getting even more utilization out of the compute nodes.  This is an impressive and technically fascinating improvement.

Ploybase allows for a high performance integrated query across Hadoop and relational data using two key components - External Tables and the HDFS Bridge.

 

Here is some example SQL that is accessing ClickStream data stored in Hadoop and additional URL data stored in a dimensional table in the PDW.

 

SELECT
url.description

FROM hdfsClickStream cs

JOIN UrlDescription url

ON cs.url = url.name

WHEREcs.url = 'www.cars.com'

 

How does it work?

  1. Polybase uses an internal representation of data structure stored in PDW metadata called an External Table to see tables in Hadoop. Once a Hadoop data source has been defined as an External Table the user can access the table in the same fashion as a relational table.
  2. To satisfy a query Polybase will import/export Hadoop data to and from the PDW to take the best advantage of the available computing power.  This data transfer will be performed by the individual PDW Compute Nodes communicating directly with the Hadoop HDFS nodes for massively parallel data transfers.
  3. Results are brought back (also in parallel) to the PDW and are temporarily (or permanently) stored on PDW. The PDW compute nodes complete processing and results are returned.  

External Tables

An EXTERNAL TABLE structure is what imposes structure upon the Hadoop data and identifies its location to PDW. The CREATE EXTERNAL TABLE statement defines the structure of the data being accessed and in the WITH clause locates the data in HDFS and specifies any formatting options. This metadata is stored on the PDF.  External Table metadata can be queried using the DMV sys.pdw_external_tables. External tables also appear in SSMS object browser.

CREATE EXTERNAL
TABLE hdfsClickStream (

    url varchar(50),

    event_date date,

    user_IP varchar(50)

)

WITH (LOCATION = 'hdfs://192.168.2.154:5000/tpch1GB/click.txt',FORMAT_OPTIONS ( FIELD_TERMINATOR = '|')

)

 

Once your external table is defined you may query it directly with TSQL and perform joins to PDW tables

SELECT url.description

FROM hdfsClickStream cs

JOIN UrlDescription url

ON cs.url = url.name

WHERE cs.url = 'www.cars.com'

HDFS Bridge

The HDFS Bridge provides an abstraction layer over HDFS. HDFS Bridge has been added to the Data Movement Service (DMS) that runs on each PDW node. This is what provides for the massively parallel data transfers.  This allows the HDFS data nodes to be queried concurrently by each compute node. Data moves in parallel directly between PDW Compute nodes and HDFS nodes. 

PolyBase was created in Jim Gray Systems Labs by  David DeWitt, former Professor Emeritus of Computer Sciences.  Dr. DeWitt is known for pioneering research in parallel databases, benchmarking, object-oriented and XML databases.  

I highly recommend checking out the lab's web site at https://gsl.azurewebsites.net/Projects/Polybase.aspx

And if you want to watch a good session tune in to David DeWitt's PASS 2012 Day 3 Spotlight Session  (Day 3 Spotlight Session with David DeWitt).