Learn SQL 2016 Polybase with HDP - Part 1 - Build the Environment

I have been working with Polybase in Parallel Data Warehouse for a few years now, so I was quite excited to hear about this great feature making it into SQL 2016. It is a great was to use tools and a language you already know (TSQL) to be able to access data stored in Hadoop. In this article, I am going to walk you through how I set up my own environment in Azure to test Polybase, and learn a little more about Hortonworks Hadoop.

You read correctly…..I am using Hortonworks Hadoop. Polybase is a great feature to use to bridge between SQL Server databases and Azure Blob Storage or HDInsight. Many customers already have huge investments in on prem Hadoop utilizing other Hadoop installations like Cloudera or Hortonworks. In order to provide the best guidance to my customers, I need to understand their whole environment. Therefore, I put together my own environment using my Azure benefits with my MSDN Subscription. Luckily, Azure has a VM template for a Hortonworks sandbox.

Here are some tips and resources on how to create the environment to use Polybase to connect to the Hortonworks sandbox. Once this is complete, you can use the test data in the sandbox to practice using Polybase. In my next post, I will show you how you can mimic some of the Hortonworks Sandbox labs with Polybase to get an idea of how it works.

    1. Create a Resource Group: I find that it is best to create everything I need for a particular lab setup within 1 resource group in Azure. That way when I am done with it, I can just delete the entire resource group and everything in it. I named mine RG_Polybase

    2. Create Windows Server 2012 R2 VM with SQL 2016 within RG_Polybase.

    3. Install the Polybase feature on the SQL VM. Note the SQL install media will be in a folder on the C drive. For this version it is c:\SQLServer_13.0_Full\ . Since this is a test environment, I just used the install GUI and default options for Polybase. There is a prerequisite of installing Java Runtime . You will need to make sure and install the 64bit Java runtime. The default download on the Java download site will be a 32 bit, so you will need to make sure you select the correct one.

    4. Configure Polybase for Hortonworks Hadoop: (hadoop connectivity reference)

    5. Restart SQL Server (and Polybase services) for the configuration change to take effect.

    6. Create HW Sandbox VM within RG_Polybase resource group. This will ensure that the 2 VMs will easily be able to connect to one another.

      This link from Hortonworks walks you right through the process: https://hortonworks.com/hadoop-tutorial/deploying-hortonworks-sandbox-on-microsoft-azure/

The sandbox has a maria_dev user for you to use along with the demos. Also, notice the address that I used is a 10.0.0.X address. To keep things simple and local to the Azure Resource Group, I am using the local IP addresses for the VMs.

NOTE: If you want to log into the Ambari portal as an admin, you will need to do the following:

  • ssh into the VM and run the following command as the root user (sudo): ambari-admin-password-reset
  • You will be prompted to enter the password. Ambari server will restart.

Your command window will look something like this when it is complete:

You can then log in to the Ambari portal as the admin user. To connect to Ambari you just browse using the IP of the sandbox and specify port 8080

TIP: I downloaded Putty on my SQL 2016 VM and use that and IE within my RDP session on that VM to do anything I need on the Hadoop sandbox. That way I can use the internal IPs, which will be faster.

NOTE: If you want to have everything on premises, you can build your own SQL 2016 VM and download a Hortonworks Sandbox VM, but I have personally switched to using Azure for all my lab environment needs as it is so easy to create/destroy/recreate lab environments at will.

In my next post… I will show you how you can use the sandbox and SQL 2016 to test Polybase queries against Hadoop data sources.