HDInsight under covers

Azure HDInsight provisions and manages Apache Hadoop clusters in Azure cloud. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution. Hadoop often refers to the entire Hadoop ecosystem of components, which includes Apache MapReduce, Apache Hive, Apache HBase, Apache Spark, and Apache Storm, as well as other technologies under the Hadoop umbrella.

Azure HDInsight delivers multiple cluster types targe ted to address different scenarios. Each cluster type is a composition of a set of Apache Hadoop components, such as Spark, Storm, or HBase. More details on the available cluster types can be found here.

In this post let's walk through a Hadoop Linux cluster type experience. We will cover Hadoop cluster creation experience and its manifestation at node and component level. Hadoop cluster create experience in Azure Portal looks as below (this is a current snapshot only and might change in the future)

PortalCreateClusterAtNodeLevel

 

 

 

 

 

 

 

 

 

 

As shown above the resultant cluster is set-up with four types of nodes

  1. Gateway: Is HTTPS gateway to the cluster. It will present SSL cert, handles cluster credentials validation, and also acts a reverse proxy to communicate with few Hadoop services running on the cluster.
  2. Zookeeper: Hosts Apache zookeeper and used by the cluster to orchestrate distributed co-coordination.
  3. HeadNode: Nodes which hosts critical Hadoop services like ResourceManager, ApplicationJobHistory, ApplicationTimeLineServer, Hive metastore, oozie etc. These services vary by cluster type.
  4. WorkerNode: This can be treated as DataNode. These nodes are the worker for the YARN

 

All HDInsight Linux clusters are ship Apache Ambari for cluster management which can be access by going to https://<clusterdnsnmae>.azurehdinsight.net in browser. It looks like below

AmbariDashboard

 

Left side bar shows the installed services/components. The middle section shows metrics about the cluster. "Hosts" at the top bar will show something like

AmbariHostsScreen

 

 

"Hosts" shows all the nodes nodes Ambari manages (Gateway nodes are not managed by Ambari, so not shown here). You can see node details like IP address, cores, RAM, HDP components, etc.

 

Node names will have prefixes which indicate the role to which they belong to. Role prefixes are

hn{num}-…  Part of head node

wn{num}-… Part of worker node

zk{num}-…. Part of zookeeper node

 

 

Clicking through an individual node shows the services/components installed on that node and also metrics for that node. Sample screenshot below

AmbariHeadnodeScreen

 

 

Accessing Cluster

HDInsight clusters out-of-box enable only HTTPS (Port 443) and SSH (Ports 22, 23) communication to the cluster. There is no other way you can communicate with Linux clusters (except for clusters in a Virtual Network).

HTTPS access: Endpoint is https://<clusterdnsname>.azurehdinsight.net with basic authentication. This is the endpoint used by HDInsight jobs SDK. This is a load balanced endpoint and request can get routed to any gateway instance.

 

SSH access: Endpoint is <clusterendpoint>-ssh.azurehdinsight.net  with SSH credentials. This endpoint connects directly to the headnodes and is not load balanced. Port 22 is mapped to hn0-... and 23 is mapped to hn1-.... (Note: suffixes 0 and 1 are used just for illustration purpose and might vary)

 

How are HDInsight cluster different from others? Here are few differences

  1. Decoupled storage (HDFS) and compute: HDInsight uses azure storage backed HDFS called WASB. This mechanism enabled deleting clusters to save costs while preserving data.
  2. Secured with HTTPS basic authentication: By-default only HTTP endpoint is accessible apart from SSH.
  3. All Hadoop major services are configured to run in HTTP mode
  4. SSH is hosted on a different FQDN from web services:

 

FAQ:

 

Are all Hadoop services available through Gateway?

HDInsight tries to cover common services/scenarios. There might still be few links in Ambari which might not work which can be accessed by SSH tunneling.

 

 

nslookup is showing IP address which is not shown in Ambari

nslookup resolves to Azure load balancer IP addresses which in-turn routes to Gateway nodes. So these IP addressed doesn't show-up in Ambari.