Just enough Azure for Hadoop - Part 1

Motivation for this blog...
On my last day at my former workplace, where I mostly worked on customer Hadoop projects on AWS, a colleague got a project that involved provisioning an IaaS Hadoop cluster on Azure; We were stumped and scrambling to figure out - there was no guide with just enough information about Azure to get us going - Azure documentation seemed overwhelming.  One of my goals, since starting at Microsoft, has been to create such a guide...so, here goes...

This blog is part of a series and covers ..
Relevant Azure fundamentals - concepts/terminology you need to know, in the context of Hadoop.  Some of the content is a copy of Azure documentation (full credit to the Azure documentation team) - compiled into a single post along with my commentary, to serve  a one stop shop for those new to Azure and thinking Hadoop.

I enjoyed writing this blog series very much, hope you derive some value out of it - at the least, a nap :).

Here's what's covered in the post:
Section 01:  Azure subscription and portal
Section 02:  Azure regions and services
Section 03:  Identity and access management
Section 04:  Azure networking

Here are links to the rest of the blog series-
Just enough Azure for Hadoop - Part 2 | Focuses on storage
Just enough Azure for Hadoop - Part 3 | Focuses on compute
Just enough Azure for Hadoop - Part 4 | Focuses on select Azure Data Services (PaaS)

1. Azure Subscription & Portal

1.1. Azure Subscription
This is a pre-requisite to working on Azure - you need a subscription to use Azure.
A subscription has a unique identifier.
The following is a link to create an Azure account/subscription.
https://azure.microsoft.com/en-us/free/?cdn=disable

As part of provisioning clusters on Azure via Hortonworks Cloudbreak or Cloudera Director, you need to provide Azure credentials.  This is one component of the set of information needed by these cloud provisioning tools.

1.2. Azure portal
This is the web based GUI for Azure
portal.azure.com

2. Azure Regions and Services

2.1. Azure Regions
A region is an Azure availability location.
Listing of regions

This is an important consideration when you plan to host applications on Azure.

2.2. Azure Services
An Azure service refers to an offering on Azure - compute, networking, storage, databases etc.
https://azure.microsoft.com/en-us/services/

2.3. Services by Regions
Azure services are launched world-wide or by regions.  The link below details services availability by region.
https://azure.microsoft.com/en-us/regions/services/

2.4. Region-pairing for DR
For services that offer geo-redundancy, there is the concept of paired datacenters, and there is a single default data center pairing.
E.g. US East is paired with US West, and US East2 with US West2
Documentation

2.5. Ingress/Egress
Ingress: Refers to data coming into Azure datacenter.  There is no charge for ingress.
Egress: Refers to data going out of Azure datacenters.  There is a charge for this as detailed here.

3. Identity and Access Management

3.1. Azure Active Directory (AAD)
Azure Active Directory (Azure AD) is Microsoft's multi-tenant, cloud based directory and identity management service. Azure AD combines core directory services, advanced identity governance, and application access management. Azure AD also offers a rich, standards-based platform that enables developers to deliver access control to their applications, based on centralized policy and rules.  Documentation

From a Hadoop perspective, this service is important from an identity and access management perspective (IAM) if your organization decides to leverage the same.  Even without, you need  to drive role based access control (RBAC) for you Azure resources.

3.2. Azure Tenant
In Azure Active Directory (Azure AD), a tenant is representative of an organization. It is a dedicated instance of the Azure AD service that an organization receives and owns when it signs up for a Microsoft cloud service such as Azure, Microsoft Intune, or Office 365. Each Azure AD tenant is distinct and separate from other Azure AD tenants.  An Azure tenant has a unique identifier.

Lets say, you want to use Cloudera Director/Hortonworks Cloudbreak, you will need your Azure tenant ID to set up credentials for cluster provisioning.

3.3. Application & Service Principal
This may be a bit heavy but will make sense as we get into Cloudera and Hortonworks deployments on Azure - you will need to create these objects and use them in Cloudera Director/Hortonworks Cloudbreak.

The information below is straight off of the Azure documentation -
An application that has been integrated with Azure AD has implications that go beyond the software aspect. "Application" is frequently used as a conceptual term, referring to not only the application software, but also its Azure AD registration and role in authentication/authorization "conversations" at runtime. By definition, an application can function in a client role (consuming a resource), a resource server role (exposing APIs to clients), or even both. The conversation protocol is defined by an OAuth 2.0 Authorization Grant flow, allowing the client/resource to access/protect a resource's data respectively. Now let's go a level deeper, and see how the Azure AD application model represents an application at design-time and run-time.

When you register an Azure AD application in the Azure portal, two objects are created in your Azure AD tenant: an application object, and a service principal object.

Application object:
An Azure AD application is defined by its one and only application object, which resides in the Azure AD tenant where the application was registered, known as the application's "home" tenant.

Service principal object:In order to access resources that are secured by an Azure AD tenant, the entity that requires access must be represented by a security principal. This is true for both users (user principal) and applications (service principal). The security principal defines the access policy and permissions for the user/application in that tenant. This enables core features such as authentication of the user/application during sign-in, and authorization during resource access.

When an application is given permission to access resources in a tenant upon registration or consent, a service principal object is created.

Application and service principal relationship:Consider the application object as the global representation of your application for use across all tenants, and the service principal as the local representation for use in a specific tenant. The application object serves as the template from which common and default properties are derived for use in creating corresponding service principal objects. An application object therefore has a 1:1 relationship with the software application, and a 1:many relationships with its corresponding service principal object(s).

A service principal must be created in each tenant where the application is used, enabling it to establish an identity for sign-in and/or access to resources being secured by the tenant. A single-tenant application has only one service principal (in its home tenant), created and consented for use during application registration. A multi-tenant Web application/API also has a service principal created in each tenant where a user from that tenant has consented to its use.

The steps for creating an application and service principal is here.

3.4. Azure Active Directory Domain Services (AAD DS)
Many enterprise customers prefer PaaS identity solutions on Azure and sync their on-premise directory solutions with Azure Active Directory.  Azure Active Directory supports Oauth and other authentication protocols but not Kerberos, Hadoop supports only Kerberos for strong network authentication. To support applications/technologies that require Kerberos authentication, and in conjunction with Azure Active Directory for identity and access management (IAM), Microsoft released a PaaS service called Azure Active Directory Domain Services (AAD DS).  Under the hood, a couple of the components are Active Directory Domain Controller in HA, so primary and secondary domain controllers.   AAD DS acts as the conduit between Kerberos requiring applications and Azure Active Directory for IAM and maps your outh based identity objects to Kerberos equivalents for you, and further, lets you kerberize against it.

So, in the context of Hadoop, here is how you could use AAD DS.
Lets say, your enterprise is already using Azure Active Directory for groups and users, (cloud only or synced from on-prem directory service),  you can provision an AAD DS against your Azure tenant; AAD DS will sync in your groups and users, and auto-sync incremental changes at specific intervals (20 minutes at the time of writing this blog);  So, its effectively a one way trust;
You can then kerberize your cluster against AAD DS just like you would with an Active Directory domain controller.   All the Hadoop service principals and the machine principals get created in AAD DS.

 

Previously AAD DS supported only classic mode, today supports provisioning into an ARM virtual network.  ARM is covered further in this post.

Here is a blog post, from Paige Liu of Microsoft, detailing how to Kerberize a Cloudera cluster against AAD DS.

4. Azure Networking

Azure has a number of networking services.  Those relevant to Hadoop are covered below.
https://azure.microsoft.com/en-us/services/?filter=networking

4.1. Azure ExpressRoute
Azure ExpressRoute lets you create private connections between Azure datacenters and infrastructure that is on your premises or in a colocation environment.
https://azure.microsoft.com/en-us/services/expressroute/

Azure enterprise customers typically leverage ExpressRoute or VPN gateway to connect their on-premises data center to their Hadoop cluster virtual network; This creates a seamless on-premise like, and secure experience leveraging private IPs of Hadoop cluster nodes.  Here is an example..

Here is an implementation of Cassandra (not Hadoop, yes) on Azure, for one of our customers, by my Azure Apps and Infra Cloud Solution Architect counterpart, Ed Mondek.  It demonstrates the ExpressRoute configuration.

4.2. Azure VPN Gateway
Azure VPN Gateway connects your on-premises networks to Azure through Site-to-Site VPNs in a similar way that you set up and connect to a remote branch office. The connectivity is secure and uses the industry-standard protocols Internet Protocol Security (IPsec) and Internet Key Exchange (IKE).
/en-us/azure/vpn-gateway/

As mentioned in the ExpressRoute section, we see customers use either ExpressRoute or Azure VPN Gateway.

4.3. Azure Load Balancer
Azure Load Balancer distributes Internet and private network traffic among healthy service instances in cloud services or virtual machines. It lets you achieve greater reliability and seamlessly add more capacity to your applications.
https://azure.microsoft.com/en-us/services/load-balancer/

So, in the context of Hadoop, think of any service that needs load-balancing, you would front it with Azure load balancer.

4.4. Azure DNS
Azure DNS lets you host your DNS domains alongside your Azure apps and manage DNS records by using your existing Azure subscription. Microsoft's global network of name servers has the reach, scale, and redundancy to ensure ultra-fast DNS responses and ultra-high availability for your domains. With Azure DNS, you can be sure your DNS will always be fast and available.
https://azure.microsoft.com/en-us/services/dns/

In the context of Hadoop, we see some customers use Azure DNS, and some use their own DNS servers (Iaas).

4.5. Virtual Network (VNet)
Azure Virtual Network lets you create private networks in the cloud with full control over IP addresses, DNS servers, security rules, and traffic flows. Securely connect a virtual network to on-premises networks by using a VPN tunnel, or connect privately by using the ExpressRoute service.

In the context of Hadoop, your cluster would reside in a VNet.  As shown in the diagram under ExpressRoute, there would be multiple subnets that you would create to logically isolate and apply different security rules by subnet.

Vnet capabilities: (straight off of Azure documentation)

4.5.1. Isolation
VNets are isolated from one another. You can create separate VNets for development, testing, and production that use the same CIDR address blocks. Conversely, you can create multiple VNets that use different CIDR address blocks and connect networks together. You can segment a VNet into multiple subnets. Azure provides internal name resolution for VMs and Cloud Services role instances connected to a VNet. You can optionally configure a VNet to use your own DNS servers, instead of using Azure internal name resolution.

You can implement multiple VNets within each Azure subscription and Azure region. Each VNet is isolated from other VNets. For each VNet you can:

  • Specify a custom private IP address space using public and private (RFC 1918) addresses. Azure assigns resources connected to the VNet a private IP address from the address space you assign.
  • Segment the VNet into one or more subnets and allocate a portion of the VNet address space to each subnet.
  • Use Azure-provided name resolution or specify your own DNS server for use by resources connected to a VNet.

4.5.2. Internet connectivity
All Azure Virtual Machines (VM) and Cloud Services role instances connected to a VNet have access to the Internet, by default. You can also enable inbound access to specific resources, as needed.  You can block connectivity to the internet as needed.

All resources connected to a VNet have outbound connectivity to the Internet by default. The private IP address of the resource is source network address translated (SNAT) to a public IP address by the Azure infrastructure.  You can change the default connectivity by implementing custom routing and traffic filtering.

To communicate inbound to Azure resources from the Internet, or to communicate outbound to the Internet without SNAT, a resource must be assigned a public IP address.

4.5.3. Azure resource connectivity
Azure resources such as Cloud Services and VMs can be connected to the same VNet. The resources can connect to each other using private IP addresses, even if they are in different subnets. Azure provides default routing between subnets, VNets, and on-premises networks, so you don't have to configure and manage routes.

4.5.4. VNet connectivity
VNets can be connected to each other, enabling resources connected to any VNet to communicate with any resource on any other VNet.

Peering:
Enables resources connected to different Azure VNets within the same Azure region to communicate with each other. The bandwidth and latency across the VNets is the same as if the resources were connected to the same VNet.

Global VNet peering:
Global VNet Peering enables peering virtual networks in different Azure regions. Traffic that flows through peered virtual networks never leaves the Microsoft backbone network. You can create a global, private peered virtual network through Global VNet Peering, enabling a variety of scenarios such as data replication, disaster recovery, and database failover through private IPs alone.  Documentation

Examples of where you could use this feature is - distcp to DR where you want all machines to have private IPs only across environments, Hbase replication...

VNet-to-VNet connection:
Enables resources connected to different Azure VNet within the same, or different Azure regions. Unlike peering, bandwidth is limited between VNets because traffic must flow through an Azure VPN Gateway.

4.5.5. On-premises connectivity:
VNets can be connected to on-premises networks through private network connections between your network and Azure, or through a site-to-site VPN connection over the Internet.

 You can connect your on-premises network to a VNet using any combination of the following options:

  • Point-to-site virtual private network (VPN): Established between a single PC connected to your network and the VNet. This connection type is great if you're just getting started with Azure, or for developers, because it requires little or no changes to your existing network. The connection uses the SSTP protocol to provide encrypted communication over the Internet between the PC and the VNet. The latency for a point-to-site VPN is unpredictable, since the traffic traverses the Internet.
  • Site-to-site VPN: Established between your VPN device and an Azure VPN Gateway. This connection type enables any on-premises resource you authorize to access a VNet. The connection is an IPSec/IKE VPN that provides encrypted communication over the Internet between your on-premises device and the Azure VPN gateway. The latency for a site-to-site connection is unpredictable, since the traffic traverses the Internet.
  • Azure ExpressRoute: Established between your network and Azure, through an ExpressRoute partner. This connection is private. Traffic does not traverse the Internet. The latency for an ExpressRoute connection is predictable, since traffic doesn't traverse the Internet

4.5.6. Traffic filtering: VM and Cloud Services role instances network traffic can be filtered inbound and outbound by source IP address and port, destination IP address and port, and protocol.

You can filter network traffic between subnets using either or both of the following options:+

  • 4.5.6.1. Network security groups (NSG): Each NSG can contain multiple inbound and outbound security rules that enable you to filter traffic by source and destination IP address, port, and protocol. You can apply an NSG to each NIC in a VM. You can also apply an NSG to the subnet a NIC, or other Azure resource, is connected to.  You can prioritize inbound and outbound rules.
    DocumentationPictorial overview of how NSG's work:

    Default rules:
    All NSGs contain a set of default rules. The default rules cannot be deleted, but because they are assigned the lowest priority, they can be overridden by the rules that you create.

    The default rules allow and disallow traffic as follows:

    • Virtual network: Traffic originating and ending in a virtual network is allowed both in inbound and outbound directions.
    • Internet: Outbound traffic is allowed, but inbound traffic is blocked.
    • Load balancer: Allow Azure’s load balancer to probe the health of your VMs and role instances. If you are not using a load balanced set you can override this rule.

    Associating NSGs:
    You can associate an NSG to VMs, NICs, and subnets, depending on the deployment model you are using, as follows:

      • NIC: Security rules are applied to all traffic to/from the NIC the NSG is associated to.
      • Subnet: Security rules are applied to any traffic to/from any resources connected to the VNet.

    You can associate different NSGs to a NIC, depending on the deployment model, or the subnet that a NIC  is connected to.

    Order of implementation of security rules:
    Security rules are applied to the traffic, by priority, in each NSG, in the following order:

    • Inbound traffic
      1. NSG applied to subnet: If a subnet NSG has a matching rule to deny traffic, the packet is dropped.
      2. NSG applied to NIC (Resource Manager): If NIC NSG has a matching rule that denies traffic, packets are dropped at the NIC, even if a subnet NSG has a matching rule that allows traffic.
    • Outbound traffic
      1. NSG applied to NIC : If a NIC NSG has a matching rule that denies traffic, packets are dropped.
      2. NSG applied to subnet: If a subnet NSG has a matching rule that denies traffic, packets are dropped, even if a NIC NSG has a matching rule that allows traffic.
  • 4.5.6.2. Network virtual appliances (NVA): A NVA is a VM running software that performs a network function, such as a firewall. NVAs are also available that provide WAN optimization and other network traffic functions. NVAs are typically used with user-defined or BGP routes. You can also use an NVA to filter traffic between VNets.  You can visit Azure marketplace to view available NVA offerings.

4.5.7. Routing: You can optionally override Azure's default routing by configuring your own routes, or using BGP routes through a network gateway.

Azure creates route tables that enable resources connected to any subnet in any VNet to communicate with each other, by default. You can implement either or both of the following options to override the default routes Azure creates:

  • User-defined routes: You can create custom route tables with routes that control where traffic is routed to for each subnet. Documentation
  • BGP routes: If you connect your VNet to your on-premises network using an Azure VPN Gateway or ExpressRoute connection, you can propagate BGP routes to your VNets.

4.5.8. IP addresses for Virtual Machines:
By default, VMs have private and public IP addresses.  You can dissociate the public IP address from the VM and delete it.
You can provision public IP addresses as needed and choose between static/dynamic, ipv4/v6 and other criteria.

We typically see customers not have public IPs on their cluster nodes.

4.5.9. VNet service endpoint:
Virtual Network (VNet) service endpoints extend your virtual network private address space and the identity of your VNet to the Azure PaaS services, over a direct connection. Endpoints allow you to secure your critical Azure service resources to only your virtual networks. Traffic from your VNet to the Azure service always remains on the Microsoft Azure backbone network.

In summary, you can ACL the supported PaaS services to accept traffic ONLY from your Vnet providing added security.  This feature is in preview and has ben rolled out to specific regions only.  Documentation

In the context of Hadoop, VNet service endpoint would apply if you were to use Azure Blob Storage as secondary HDFS - you would want to allow access only to traffic from your Hadoop Vnet. Here is an example with Microsoft's Hortonworks PaaS - HDInsight.

Blog series:

This blog post covered Azure networking, my next blog will delve into Azure storage.
Here is the complete listing of the series.

Just enough Azure for Hadoop - Part 1 | Focuses on networking, other basics
Just enough Azure for Hadoop - Part 2 | Focuses on storage
Just enough Azure for Hadoop - Part 3 | Focuses on compute
Just enough Azure for Hadoop - Part 4 | Focuses on select Azure Data Services (PaaS)

Thanks to fellow Azure Data Solution Architect, Ryan Murphy for his review and feedback.