Today we announced general availability of Azure Data Lake services including Azure Data Lake Store – the first cloud Data Lake built to the open Hadoop File System (HDFS) standard with enterprise grade security and massive scale. You can now unlock value from all your unstructured, semi-structured and structured data by running massively parallel analytics on any amount of data. There are no artificial constraints on the amount of data, number of files or the size of individual files that can be stored. As an example, individual files in Azure Data Lake Store can be petabytes in size, which is at least 200x larger than any other cloud storage service.
Thousands of customers in our public preview have transformed their business with data using Azure Data Lake Store. We have recently added more enterprise-grade security capabilities, provided uptime guarantees and made it even easier to manage. In this blog, we will deep dive into some key capabilities of Azure Data Lake Store that make it the world’s best data store for big data analytics.
Enterprise Data Lake – For today and tomorrow
To optimize their business, modern data-driven enterprises need to rely on a wide variety of data sources. Data from these sources typically comes in a variety of formats and sizes. The large amounts of data with varying formats present unique problems for conventional data systems that have rigid schemas and limited storage. Azure Data Lake Store addresses these challenges by allowing enterprises to store raw data as-is and keep all data in a single place. With Azure Data Lake Store, schemas only need to be defined at the time of using data, instead of at the time of storing. This “Schema on Read” approach empowers companies to store all data easily without designing data structures ahead of time in a single place. A central, enterprise-wide data lake eliminates data silos, encourages data sharing and experimentation.
Using conventional object stores for storing large data sets for analytics purpose has always been challenging. Artificial limits placed by many object stores make them unsuitable as a data lake. For instance, object stores are unable to store large files that can be hundreds of terabytes in size such as medical data, gene sequences, seismic data sets, speech, high-resolution video and many more. Azure Data Lake Store is the only cloud service that can store such large files. Files in the data lake store have no limits and can grow to petabytes. The figure below shows an example of a 1PB file that was created in Azure Data Lake Store.
With this elastic scale, enterprises can be confident that Azure Data Lake Store will serve their requirements now and in the future as their needs grow.
Scalable performance for massively parallel analytics
Today’s enterprises have increasing demands on the amount of data that needs to be analyzed in a short amount of time. As businesses grow, their analytic applications must efficiently meet these performance needs. Prior to data lake store, application developers faced significant barriers to scale due to limitations of storage infrastructure. These shortcomings required expensive redesign of the application, cumbersome repartitioning of data and often complete rewrite of the applications upon hitting storage scale limits. Azure Data Lake Store provides modern enterprises the assurance that it will scale in-place to meet their needs as their business grows. This central, enterprise-wide repository scales to meet the demand of a variety of analytics engines running concurrently at a large scale.
Azure Data Lake Store provides the performance to match any size of analytics workload. It provides massive throughput to run analytic jobs with thousands of parallel processes that read and write hundreds of terabytes of data efficiently. The larger the size of the data processed by a job, the more benefit you will get by using Azure Data Lake Store. For example, the big data job below linearly performs better as the number of Analytic Units of Azure Data Lake Analytics are increased.
HDFS for the cloud
Azure Data Lake Store is a hierarchical file system that is compatible with the HDFS open standard. As a file system, data lake store provides supports for first class primitives such as files and folders. This helps developers to freely organize the data and administrators to manage it easily. You can rename a folder, delete a folder and change permission on files in an atomic manner.
An important design goal for the data lake store is to work seamlessly with all applications within the Hadoop ecosystem. To achieve this goal, we designed the data lake store to be accessible through a WebHDFS-compatbile API, an open REST interface for HDFS. You can easily run Hadoop and Spark based applications to access data in the data lake store as well as migrate your existing Hadoop data to Azure Data Lake Store without recreating your HDFS directory structure.
We have also added the OAuth 2.0 support for the WebHDFS API and contributed those changes to Apache Software Foundation.
Always encrypted, role-based security, ACLs and Auditing
Solid security is an essential requirement for managing valuable data stored in the enterprise data lake. Compromise of these data assets puts the business at risk and increases regulatory compliance exposure. Azure Data Lake Store is engineered to help enterprises meet their security requirements. It seamlessly integrates with Azure Active Directory, provides fine-grained role-based access to data, audits all activity and ensures data integrity with encryption-at-rest and in-transit.
Authentication: Azure Data Lake Store is the only cloud store that is natively integrated with Azure Active Directory (AAD), a comprehensive identity and access management solution that simplifies the management of users and groups. Azure AD offers advanced capabilities such as lifecycle management for millions of identities, integration with on-premises Active Directory, single sign-on, multi-factor authentication and support for industry standard open authentication protocols.
Authorization: The data lake store provides role-based access control (RBAC) via POSIX-compliant Access Control Lists (ACLs) for managing access to the data in the store. This provides a powerful mechanism for managing fine grained permission to all your data at scale. To control read, write and execute permission to files and folders for users and groups, you can use the combination of Minimal ACLs and Extended ACLs in POSIX.
Encryption: Data in Azure Data Lake Store is always encrypted at rest by default and in transit. For data at rest administrators can specify whether to let Azure to manage your Master Encryption Keys (MEKs) or use their own MEKs. In either case, the MEKs will be stored and managed securely in Azure Key Vault, which can utilize FIPS 140-2 Level 2 validated HSMs (hardware security module). Data-in-transit (aka data in motion) is also always encrypted, by using HTTPS (HTTP over Secure Sockets Layer).
Auditing: The data lake store provides rich auditing capability to help meet enterprise security and regulatory compliance requirements. All account management and data access activities are always audited. Administrators can securely access their audit logs in JSON format and can further analyze them by the tools of their choice.
Enterprise-grade Support and SLA
With Azure Data Lake Store enterprises get industry-standard availability and reliability. Your data assets are stored durably by making redundant copies to guard against any unexpected failures within a region. The data lake store provides a 99.9% monthly uptime SLA and 24/7 support for your big data solution.
Azure Data Lake Store is the most scalable, secure, enterprise-ready data store in the cloud that is highly optimized for big data analytics workloads. To get started with Azure Data Lake Store, you can visit this page which includes our documentation and helpful videos. You can also suggest new features and capabilities for the data lake store here.
- Azure Data Lake Store Team