NOTE This post is part of a series on a deployment of Apache Drill on the Azure cloud.
With my Drill cluster deployed to the Azure cloud, another potential source of data is Azure HDInsight, Microsoft's managed Hadoop offering. HDInsight makes use of WASB storage so that if the structure of my data was pretty straightforward, I might by-pass HDInsight and just go directly to the files in WASB to obtain access. But I may prefer to surface the data through Hive if wished to make use of file processing/interpretation instructions embedded in Hive table definitions.
NOTE Drill officially supports Hadoop connectivity to either Hive or HBase. I address HBase connectivity here.
There are two official ways to access Hive data through Drill and only one of these - what the Drill documentation refers to as the Remote Metastore Configuration - is recommended for production systems. I'll focus on this configuration in this post.
That said, I would like to point out that Drill supports JDBC connections and Hive comes with a JDBC driver. I have been able to configure a JDBC connection from Drill to Hive but find that Drill is generating invalid Hive QL SELECT statements. I've opened a JIRA ticket on this issue and hope this becomes yet another option down the road.
Returning to the Remote Metastore Configuration, it helps to understand that what Drill is really doing is reading the table definitions from the Hive metastore and then engaging the data files associated with these tables directly. There are a few implications to this.
First, Drill must be able to speak to the Thrift service which provides access to the metastore content. Second, Drill must be able to access the files under the Hive tables. And finally, Drill must have access to the SerDe's used by Hive in order to process the data. With all this in mind, I can now establish connectivity to Hive in Azure HDInsight with the following steps:
- Deploy Azure HDInsight cluster to a Virtual Network accessible by Drill
- Configure WASB access within Drill
- Deploy required SerDes to Drill servers
- Configure Hive access within Drill
- Confirm Hive access within Drill
Deploy Azure HDInsight Cluster to an Accessible Virtual Network
My first action is to deploy my Azure HDInsight cluster to a virtual network that's accessible by Drill. This is important because the Thrift service used to access the Hive metastore is not accessible from outside the ring-fence created by the virtual network that surrounds Azure HDInsight.
If I'm deploying a new cluster, I might deploy it into the same virtual network used by Drill. (In this scenario, I might deploy the cluster to a separate subnet within that Virtual Network just for manageability purposes.) If I have an existing Azure HDInsight cluster in a virtual network and have deployed Drill to a separate virtual network, I might configure the two virtual networks to leverage a vnet-to-vnet VPN tunnel per these instructions.
Configure WASB Access within Drill & Deploy Required SerDes to Drill Servers
I'll combine these steps as these actions must be performed on each Drill server. Following the instructions provided here, I need to configure access to all WASB accounts used by Azure HDInsight. In addition, I would need to copy each SerDe explicitly leveraged by my Hive tables from the /usr/hdp/current/hive-client/ directory on one of my Azure HDInsight head nodes to the /drill/current/jars/3rdparty/ directory on my Drill server. Assuming I want to leverage the built-in RegEx SerDe, I could copy it to a Drill server with the following command:
sudo scp firstname.lastname@example.org:/usr/hdp/current/hive-client/regepx-1.3.jar .
In the scp command above, be sure to provide the name of the SSH user account created with the cluster and substitute the name of your cluster for myclustername. (Also, don't forget the space+period at the end of the command; I always seem to forget to include this with scp.)
NOTE I have the sense that copying the SerDes will be the trickiest part of this whole configuration. I suggest you come back to this once you have established basic connectivity. The test at the bottom of this post leverages a built-in sample Hive table for which no SerDes need to be transferred between the environments.
Configure Hive Access within Drill
With WASB access and required SerDe's loaded to the Drill servers, I can now setup the Hive Storage Plugin. Before doing this, I need to grab a few values from the config files of my HDInsight cluster. The easiest way to do this is to use Amabari:
- Open a browser to https://myclustername.azurehdinsight.net with appropriate substitution for myclustername
- At the prompt, login using the HTTP user name and password established when you configured the cluster
- Within the Ambari interface, go to Services | Hive | Configs | Advanced
- Under the General subheading, locate and record the values for hive.metastore.uris and hive.metastore.warehouse.dir
- Navigate to Services | HDFS | Configs | Advanced
- Under the Advanced core-site subheading, locate and record the value for fs.defaultFS
With these values recorded, you can now setup the storage plugin. In your Drill Web Console, navigate to Storage and Update the pre-existing hive plugin. Set the value to the following, pasting the values copied earlier into the appropriate locations (as highlighted in red in this sample):
Please note, the storage plugin uses fs.default.name as the key for the property you retrieved under the name of fs.defaultFS.
I then update and enable the storage plugin to save these settings.
Confirm Hive Access within Drill
With the plugin enabled, I should now be able to query Hive. If I simply flip over to the Query page in the Drill Web Console, I should be able to query the built-in sample table deployed with Azure HDInsight:
SELECT * FROM hive.`hivesampletable` LIMIT 10;