Securing Azure HDInsight with Apache Ranger & Azure Active Directory Domain-joined Clustering

Enterprise Security layers in Hadoop consists of four pillars on Azure -

  • Perimeter Security
  • Authentication
  • Authorization
  • Auditing
  • Data with Encryption

enterprisesecurity

Recently, there has been announced the availability of Azure HDInsight Premium clusters which contains the features of

  • Apache Ranger
  • domain-joining
  • Secure Shell(SSH) access
  • HDInsight Applications
  • Custom VNET
  • Custom Hive Metastore
  • Custom Oozie metastore
  • Azure Data Lake Store access.

Currently(as of writing this blog), Azure HDInsight Premium clusters are available for 'Hadoop' type clusters only & with the version of 'Hadoop 2.7.3(HDI 3.5) ' & Linux type OS. If the 'Premium' HDI cluster provisioning facility is not available in your subscription, please write to HDInsight Team to allow your subscription for accessing preview/upcoming production level features of HDInsight.

 

premiumhdi-cluster

The domain-joined HDI cluster needs to be associated with custom Azure VNET & subnet which further needs to be configured with Azure AD domain controller name, username, password, Organizational Unit(OU), LDAPS URL, access user group.

For domain-joined HDInsight cluster provisioning steps, refer to this article. The solution architecture diagram for configuration of domain-joined HDI cluster is as followed:

domainarch

During provisioning the domain-joined cluster, the Azure AD cluster identity can be configured as well by which data from Azure Data Store storage (ADLS) can be accessed from HDI cluster & analysed with Hadoop eco-system tools.

In this demo, for domain-joined HDI clusters configuration & securing with Apache Ranger, The cluster is domain-joined with an Azure VM with domain-controller(DC) configured along with Reverse DNS entry port(lookup zone)  configuration for secure Kerberos authentication purpose. Please make sure to use domain SSL certificates for production level domain-joined HDI cluster provisioning.

Once, the domain-joined HDI cluster is provisioned successfully, you may see the following resources in your azure template.

 

domain

Under properties tab of the new provisioned domain-joined HDI cluster, select ' Domain Configuration' to start accessing Apache Ranger portal for secure authentication, authorization purpose in HDI cluster.

ranger

 

Click on 'Ranger' tab on above of 'Directory type' & provide only your Azure AD admin credentials (neither the credentials of your HDInsight cluster nor the SSH credentials of the Linux type HDI node). You may also alternatively access it through this url : https://<yourdomain-joinedhdiclustername>.azurehdinsight.net:6080 rangerauthcred

Apache Ranger provides comprehensive security to deliver monitor, manage data security to Hadoop clusters. It provides a centralized platform define, administer & manage security policies in Hadoop. On opening Ranger portal, click on 'Hive' to configure Hive metastore & tables security policies.

rangerpolicies

 

Here, for domain-joined HDI clustering, I've configured two Azure AD groups , one of them is called 'HiveUsers' which has two different users('HiveUser1' & 'HiveUser2') (synonyms to 'Alice' & 'Bob' for Kerberos authentication platform) to configure different level of data access & security level policies for Hive tables data. For example, 'HiveUser1' has access to full schemas of default Hive sampletable metastore data whereas the 'HiveUser2' has only 'Read' access to two difference data schemas of default metastore.

rangerdomain

 

Hive ODBC DSN Access for configuring BI Dashboards in Excel & PowerBI : For system DSN accessing too, the Hive ODBC driver needs to be installed based on your OS(x64/x86) & HDI cluster details needs to be configured & under credentials the Azure AD domain controller admin user credentials need to be provided(neither domain-joined HDI clusters user credentials nor cluster SSH credentials).

systemdsn

 

Now, on PowerBI Desktop to build a BI dashboard taking data from 'HiveSampleTable' of domain-joined HDI clusters as HiveUser2 , the difference in range of data accessing capability can be verified. If we try to access the whole hive tables schemas as 'HiveUser2', it would throw an exception whereas only HiveUser1 would be having access to all default metastore data schemas.

 

systemuser

More details on configuring & securing Hive policies through Apache Ranger in Azure HDInsight can be found in this MSDN blog.