Introducing File and Folder ACLs for Azure Data Lake Store


Overview

We’re excited today to announce the availability of File and Folder ACLs for the Azure Data Lake Store. Many of you have been eagerly awaiting this feature because it is critical in securing their big data.

When we launched the preview of Data Lake Store in October 2015, filesystem security was controlled by a single ACL at the root of store that applied to all files and folders underneath.

Starting today, ACLs can be set on any file or folder within the store, not just the root folder.

The Access Control Model used by Data Lake Store

We’ve emphasized that Azure Data Lake Store is compatible with WebHDFS. Now that ACLs are fully available, it’s important to understand the ACL model in WebHDFS/HDFS because they are POSIX-style ACLs and not Windows-style ACLs.  Before we dive deep into the details on the ACL model, here are key points to remember.

  • POSIX-STYLE ACLs DO NOT ALLOW INHERITANCE. For those of you familiar with POSIX ACLs, this is not a surprise. For those coming from a Windows background this is very important to keep in mind. For example, if Alice can read files in folder /foo, it does not mean that she can read files in /foo/bar. She must be granted explicit permission to /foo/bar. The POSIX ACL model is different in some other interesting ways, but this lack of inheritance is the most important thing to keep in mind.
  • ADDING A NEW USER TO DATA LAKE ANALYTICS REQUIRES A FEW NEW STEPS. Fortunately, a portal wizard automates the most difficult steps for you.

The FULL DESCRIPTION of the Access Control model is here: https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-access-control/

Adding a New Data Lake Analytics User

If you want a new user to run U-SQL jobs in ADLA, the overall steps are shown below:

  1. Assign the user to a role in the Azure Data Lake Analytics account (using Azure RBAC)
  2. *Optional* Assign the user to a role in the Azure Data Lake Store account (using Azure RBAC)
  3. Run the ADLA “Add User Wizard” for the user
  4. Give user R-X access on all folders and their subfolders recursively where data must be read by U-SQL jobs
  5. Give user RWX access on all folders and their subfolders recursively where data must be written by U-SQL jobs

Detailed Instructions can be found here: https://1drv.ms/w/s!AvdZLquGMt47gzohZ69Ob47k-P_y

Adding an New Data Lake Store User

Detailed Instructions can be found here: https://1drv.ms/w/s!AvdZLquGMt47gzyviEyNrAn8kAqS

Giving an HDInsight Cluster Access to Data Lake Store

Detailed Instructions can be found here: https://1drv.ms/w/s!AvdZLquGMt47gz3ks4YwQRMXGi3j

ProTip: Leverage the power of Active Directory Security groups

Repeating manual steps is both irritating and prone to error. It’s easier if you use Active Directory security groups.

First give the needed permissions to the security group. Afterwards, adding new users is simple: just add them to the security group. This will dramatically simplify maintaining and securing your Data Lake.

Comments (3)

  1. Jorg Klein says:

    Great news, thanks!

    The word documents need some corrections:
    Doc: Understanding AC - on page 8 PowerShell scripts are mentioned, they can be found in the other doc, while the current doc is mentioned.
    Doc: Add new User - the link to Github is broken: Download the “Add-AdlaJobUser.ps1” PowerShell script from our Github.

    1. Saveen Reddy says:

      Thanks for catching that, Jorg!

      - We've fixed the Understanding AC doc to remove the reference to PowerShell. Instead the other docs linked in the blog post contain the PowerShell information.
      - The link to the script is also now fixed
      - The blog post now has two additional links to (1) a doc on adding users to ADL Store and (2) a doc on letting an HDInsight cluster use ADL Store

  2. Roy.Kim says:

    When creating my HDInsight hadoop cluster, I am finding that providing my HDInsight Azure AD service principal to a folder within my Azure Data Lake Store sub folder takes a long time since there are 2500 files (i.e. around 5 mins). Any guidance to make this faster or should I keeping my file count (i.e. json, csv) lower (but larger)? I intend on working in a scenario where I reach terabytes of data by large file size and large quantity of files. Appreciate any guidance.

Skip to main content