Knowing the types and functions of nodes in HDInsight is key to taking full advantage of the service. This article is aimed at users who are familiar with big data concepts but are newer to HDInsight. Please feel free to read the article and provide me feedback even if you’re beyond the target audience for this article. There are already plenty of informative articles publicly available on various HDInsight components and architectures; in this article, I’ll give some introductory information and point you to links where you can get even more details.
What is HDInsight?
From the Azure website - "HDInsight is the only fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark, Hive, MapReduce, HBase, Storm, Kafka, and R Server backed by a 99.9% SLA. Each of these big data technologies and ISV applications are easily deployable as managed clusters with enterprise-level security and monitoring." To learn more about how Hadoop components from the Hortonworks Data Platform (HDP) distribution is integrated with Azure and Hadoop ecosystem in HDInsight in general, please refer to this short introduction.
Different types of clusters in HDInsight
Currently users have option to create the following types of HDInsight clusters – Hadoop (hive), HBase, Storm, Spark, R, Kafka and Hive LLAP.
Each type has some advantages over other types in certain functionalities. I believe the best way to learn something is to explore. You can get a free trial Azure subscription and try out the different clusters. Once you get an Azure subscription you can start exploring by creating a cluster via the azure portal.
Different types of nodes in HDInsight cluster
HDInsight clusters consist of several virtual machines (nodes) serving different purposes. The most common architecture of an HDInsight cluster is – two head nodes, one or more worker nodes, and three zookeeper nodes.
Head nodes: Hadoop services are installed and run on head nodes. There are two head nodes to ensure high availability by allowing master services and components to continue to run on the secondary node in the event of a failure on the primary. Both head nodes are active and running within the cluster simultaneously. Some services, such as HDFS or YARN, are only 'active' on one head node at any given time (and ‘standby’ on the other head node). Other services such as HiveServer2 or Hive Metastore are active on both head nodes at the same time. There are services like Application Timeline Server (ATS) and Job History Server (JHS) which are installed on both head nodes but should run only on the head node where Ambari server is running. If these components sound unfamiliar, please revisit the article on Hadoop ecosystem in HDInsight.
Worker nodes: Worker nodes (also known as data nodes) perform the actual data analysis when a job is submitted to the cluster. If a worker node fails, the task that it was performing will be transferred to another worker node. Users can choose any number of worker nodes to provision during cluster creation and later scale up or down depending on the need.
Zookeeper nodes: Zookeeper nodes (ZKs) are used for leader election of master services on head nodes; and to ensure that services, data (worker) nodes, and gateways know which head node a master service is active on. By default, HDInsight provides 3 Zookeeper nodes. Please read this article on availability and reliability of Hadoop clusters in HDInsight to learn more about High Availability in HDInsight.
Gateway nodes: Beside these nodes, there are 2 Gateway nodes for management and security. Users do not have access to the Gateway nodes. HDInsight is implemented by several Azure Virtual Machines (the nodes within the cluster) running on an Azure Virtual Network (VNet). Each node residing in the same virtual network has access to each other but from outside of the vnet web access is allowed only through the gateway node. By default, users only have SSH access to the head-nodes. Users have access to Ambari Web UI to manage their cluster. Connecting to Ambari on HDInsight requires HTTPS and all HTTPS requests route through Gateways. User authentication and request forwarding are handled by the Gateway nodes. Users do not need to manage the Gateway nodes through Ambari, and so they are hidden from the users. This is a view from Ambari -> Hosts where it shows other types of nodes except Gateway nodes –
Edge nodes: There are edge nodes which users have an option of creating for accessing the cluster, and testing and hosting client applications. An empty edge node is a Linux virtual machine with the same client tools installed and configured as in the head-nodes, but with no Hadoop services running. An edge node does not actively participate in data analysis within the cluster, but it lives in the same Azure Virtual Network as the other nodes in the cluster and can directly access all other nodes. Since it is not involved in analyzing data for the cluster, it can be used without any concern of diverting resources away from critical Hadoop services or analysis jobs.
An HDInsight architecture demonstrating the nodes can be like this –
Depending on the type of the cluster, the nodes can have different names. For example, in HBase clusters there is a concept of region servers and HBase masters; and in Storm clusters head-nodes are known as Nimbus nodes and worker nodes are known as supervisor servers. Here’s an article describing the different clusters and their architecture in details.
Now that you know basics of HDInsight, and are familiar with the usual components, what can you do? As I said earlier, best way to learn more is to explore. There are few good articles on getting started with Hive, Spark, HBase, Storm, R, Interactive Hive, and Kafka which you should feel free to explore. for Spark, we also have our Spark debugging 101 series which we plan to keep on updating. You should also check out this article containing details of which HDInsight components are officially supported and which are deprecated. Finally, I would recommend keeping an eye on the release notes for Hadoop components on Azure HDInsight which is regularly updated.