HDInsight Hive workload under covers


HDInsight under covers post covered cluster creation/set-up overview. Apache Hive is the most popular component of Hadoop type clusters. Hive defines a simple SQL-like query language, that enables users familiar with SQL to query HDFS data. Undercovers Hive translates the higher level SQL query to YARN application (MapReduce/Tez) for real execution.

HDInsight ships multiple version’s. Azure portal lists the list of current versions. Each version of cluster ships with specific component versions. HDInsight used HDP (HortonWorks Data Platform) distribution. Below table lists the version mapping between HDInsight and HDP.

HdiHdpMapping

http://hortonworks.com/hdp/whats-new/ covers specific component version for a HDP distribution.

QueryWorkloadStack

Below layered stack covers the popular HDInsight tools and how they interact with the components underneath.

 

As shown in above layered stack three high level interactions happens

  1. CLI: Hive or beeline are the CLI used to experiment/validate the Hive queries.
  2. Batch: These interactions goes through WebHCat. HDInsight Jobs SDK is a .NET nuget package which is built on top of WebHCat for programmatic query execution. This is the most popular choice most customers use for production automation.
  3. Interactive: Workload where latencies are critical served by Hiveserver. Microsoft Hive ODBC driver (or any HiveServer thrift clients) allows interactions with Hiveserver.

 

In the next blog post’s, I will cover interactions of batch and interactive workloads in-details.

 

References

Azure HDInsight Job Management Library

Visual Studio Hadoop tools for HDInsight

Microsoft HIVE ODBC Driver

Hive language manual

WebHcat REST Reference

 HDP details


Comments (0)

Skip to main content