How to access Hive using JDBC on HDInsight

While following up on a customer question recently on this topic, I realized that we have seen the same question coming up from other users a few times and thought I would share a simple example here on how to connect to HiveServer2 on Azure HDInsight using JDBC. For background, please review the apache wiki and the Cloudera blog on the architecture and the benefits of HiveServer2 for applications connecting to Hive remotely via ODBC, JDBC etc. There are also some good articles like this one which shows a step-by-step example for an on-premise Hadoop cluster. For an Azure HDInsight cluster, it's worth noting the following points –

  1. HiveServer2 service has been started as part of cluster provisioning and is running on the active Headnode of the cluster, as shown below –

    You can verify this by RDP-ing to the active Headnode of the Azure HDInsight cluster. If you don't see the HiveServer2 service running or if you find it disabled, please review this blog.

  2. HiveServer2 is running in HTTP mode, on port 10001 and can be verified from hive-site.xml configuration file located under %HIVE_HOME%\conf folder, as shown below –

    But, as explained in this blog, HDInsight is a managed cloud service and is secured via a gateway which exposes HiveServer2 endpoint (and other endpoints) on port 443. So, from your workstation, you may not be able to connect to Hiveserver2 directly on port 10001, rather client applications make a secure connection to port 443 and the gateway redirects to HiveServer2 on port 10001. So, a JDBC connection string for HDInsight may look something like this-

    jdbc:hive2://myClusterName.azurehdinsight.net:443/default;ssl=true?hive.server2.transport.mode=http;hive.server2.thrift.http.path=/hive2 

With that said, I am going to show a simple example of accessing HiveServer2 via JDBC from a JAVA application, using Maven. If you have already used Hiveserver2 using JDBC for an on-premise Hadoop cluster, you can skip the TLDR part below :-), you can just review the code sample below and see the difference in Hive JDBC connection string for Azure HDInsight. For others who like details, please review the steps below -

Developing a Java Application to access Hive via JDBC:

  1. It is assumed that Maven is installed on your workstation. Ensure that maven is added to your PATH environment variable. Open a command prompt on your workstation and change folder to where you wish to create the project. For example, cd C:\Maven\MavenProjects

  2. Use the mvn command, to generate the project template, as shown below –

    mvn archetype:generate -DgroupId=com.microsoft.css -DartifactId=HiveJdbcTest -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false 

    This will create the src directory and POM.xml in the directory HiveJdbcTest (same as artifactId)

  3. Open the POM.xml with your favorite Java IDE, I have used Intellij. You can review this blog for more details with screenshots on how to use Maven with IntelliJ or Eclipse.

  4. Open IntelliJ -> Import Project -> browse to POM.xml for your project created in step 2

    In intelliJ, delete the default class created by IDE called app and delete the test folder if you don't plan to use it. Create a Java class called MyHiveJdbcTest

  5. Modify the POM.xml to something like this –

    NOTE on POM.xml:

    a. How do we know which dependency jars we need to add? This is kind of a trial and error. I typically go to https://mvnrepository.com and do full text search and start with the ones that seem relevant. In this case, I first started with hive-jdbc JAR, at this point, my code compiles, but I still got run time error like ClassNotFoundException - then I added hadoop-common JAR. I was still getting 'Connection RESET' error from JDBC - I then added hive-exec and other dependency JARs. Also I have tried with Hive 0.13 and 0.14 JARs and both versions worked.

    b. Even though it was not necessary, I added an entry for making a shade plugin under <build>, to make it an 'uber' (or fat) JAR - this is useful when the versions of dependency JARs in the runtime environment is not predictable or keeps changing. Uber JAR contains all dependencies in itself and hence is not dependent on dependency JARs' versions in the runtime environment.

  6. Add the following code in the class MyHiveJdbcTest –

  7. Build the project using maven command line:

    cd c:\Maven\MavenProjects\HiveJdbctest
    mvn clean package 

  8. Once it builds successfully, we can step through and debug the code in IDE.

    We can also make it an executable JAR as shown in this blog.

I hope you find the blog useful! Please feel free to share your feedback.