Nodes in HDInsight

Knowing the types and functions of nodes in HDInsight is key to taking full advantage of the service. This article is aimed at users who are familiar with big data concepts but are newer to HDInsight. Please feel free to read the article and provide me feedback even if you’re beyond the target audience for this article. There are already plenty of informative articles publicly available on various HDInsight components and architectures; in this article, I’ll give some introductory information and point you to links where you can get even more details.

What is HDInsight?

From the Azure website - "HDInsight is the only fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka, and R Server backed by a 99.9% SLA. Each of these big data technologies and ISV applications are easily deployable as managed clusters with enterprise-level security and monitoring." To learn more about how Hadoop components from the Hortonworks Data Platform (HDP) distribution is integrated with Azure and Hadoop ecosystem in HDInsight in general, please refer to this short introduction.

Different types of clusters in HDInsight

Currently users have option to create the following types of HDInsight clusters – Hadoop (hive), HBase, Storm, Spark, R, Kafka and Hive LLAP.

Cluster types

Cluster types

Each type has some advantages over other types in certain functionalities. I believe the best way to learn something is to explore. You can get a free trial Azure subscription and try out the different clusters. Once you get an Azure subscription you can start exploring by creating a cluster via the azure portal.

Different types of nodes in HDInsight cluster

HDInsight clusters consist of several virtual machines (nodes) serving different purposes. The most common architecture of an HDInsight cluster is – two head nodes, one or more worker nodes, and three zookeeper nodes.

Head nodes: Hadoop services are installed and run on head nodes. There are two head nodes to ensure high availability by allowing master services and components to continue to run on the secondary node in the event of a failure on the primary. Both head nodes are active and running within the cluster simultaneously. Some services, such as HDFS or YARN, are only 'active' on one head node at any given time (and ‘standby’ on the other head node). Other services such as HiveServer2 or Hive Metastore are active on both head nodes at the same time. There are services like Application Timeline Server (ATS) and Job History Server (JHS) which are installed on both head nodes but should run only on the head node where Ambari server is running. If these components sound unfamiliar, please revisit the article on Hadoop ecosystem in HDInsight.

Worker nodes: Worker nodes (also known as data nodes) perform the actual data analysis when a job is submitted to the cluster. If a worker node fails, the task that it was performing will be transferred to another worker node. Users can choose any number of worker nodes to provision during cluster creation and later scale up or down depending on the need.

Zookeeper nodes: Zookeeper nodes (ZKs) are used for leader election of master services on head nodes; and to ensure that services, data (worker) nodes, and gateways know which head node a master service is active on. By default, HDInsight provides 3 Zookeeper nodes. Please read this article on availability and reliability of Hadoop clusters in HDInsight to learn more about High Availability in HDInsight.

Gateway nodes: Beside these nodes, there are 2 Gateway nodes for management and security. Users do not have access to the Gateway nodes. HDInsight is implemented by several Azure Virtual Machines (the nodes within the cluster) running on an Azure Virtual Network (VNet). Each node residing in the same virtual network has access to each other but from outside of the vnet web access is allowed only through the gateway node. By default, users only have SSH access to the head-nodes. Users have access to Ambari Web UI to manage their cluster. Connecting to Ambari on HDInsight requires HTTPS and all HTTPS requests route through Gateways. User authentication and request forwarding are handled by the Gateway nodes. Users do not need to manage the Gateway nodes through Ambari, and so they are hidden from the users. This is a view from Ambari -> Hosts where it shows other types of nodes except Gateway nodes –

Nodes view from Ambari -> Hosts

Nodes view from Ambari -> Hosts

Edge nodes: There are edge nodes which users have an option of creating for accessing the cluster, and testing and hosting client applications. An empty edge node is a Linux virtual machine with the same client tools installed and configured as in the head-nodes, but with no Hadoop services running. An edge node does not actively participate in data analysis within the cluster, but it lives in the same Azure Virtual Network as the other nodes in the cluster and can directly access all other nodes. Since it is not involved in analyzing data for the cluster, it can be used without any concern of diverting resources away from critical Hadoop services or analysis jobs.

An HDInsight architecture demonstrating the nodes can be like this –

Nodes in HDInsight Cluster

Nodes in HDInsight Cluster

Depending on the type of the cluster, the nodes can have different names. For example, in HBase clusters there is a concept of region servers and HBase masters; and in Storm clusters head-nodes are known as Nimbus nodes and worker nodes are known as supervisor servers. Here’s an article describing the different clusters and their architecture in details.

Now that you know basics of HDInsight, and are familiar with the usual components, what can you do? As I said earlier, best way to learn more is to explore. There are few good articles on getting started with Hive, Spark, HBase, Storm, R, Interactive Hive, and Kafka which you should feel free to explore. for Spark, we also have our Spark debugging 101 series which we plan to keep on updating. You should also check out this article containing details of which HDInsight components are officially supported and which are deprecated. Finally, I would recommend keeping an eye on the release notes for Hadoop components on Azure HDInsight which is regularly updated.


Comments (6)
  1. mikelor says:

    Good stuff. This *is* helpful. I just wanted to give some feedback to encourage more of this. :)

    1. Thanks mikelor. Glad to know that this is helpful.

  2. Soma Ghosal says:

    Hi Abdullah

    Thank you for the nice article. I am new to HDInsight and found it helpful.

    I have a doubt, it will be great if you can help.

    As you mentioned, I thought Spark clusters on HDInsight need 3 Zookeeper nodes, however:

    1.I could not see any Zookeeper node on the Azure online pricing calculator.

    2. While creating a Spark cluster through Azure portal, there was no option to specify Zookeeper node size.

    3. Following the cluster creation, I could see 3 Zookeeper nodes running from the Ambari dahsboard.

    4. I saw Azure documentation mentions A1 Zookeeper nodes for Spark are free. However the nodes in my case are A2 (2 core, 3.5GB RAM).

    So my questions are:

    1. Are Zookeeper nodes not chargeable for HDInsight Spark clusters?

    2. Is not there any option to change the default Zookeeper node size in this case?

    Thank you in advance.


    1. Hi Soma,
      You are right, users are not charged for Zookeeper nodes. For most types of clusters including Spark clusters Zookeepers do not usually need higher RAMs as they are only responsible for few activities such as maintaining an in-memory image of state, along with a transaction logs and snapshots in a persistent store in order to coordinate distributed processes. In HBase clusters Zookeepers are used as HBase Masters and therefore are required to have higher processing power. From Azure portal you would not be able to specify configuration for zookeeper nodes for your Spark cluster, but you would be able to do that for a HBase cluster.

      1. Soma Ghosal says:

        Thank you very much for the clarification Abdullah.

  3. Haritrichy says:

    In HDInsight standard cluster, where the local user credentials (for gateway layer authentication) are stored?

Comments are closed.

Skip to main content