Arsen Vladimirskiy | Updated April 27, 2016
Update from November 21, 2018: Information below is dated. Please review a more recent article from Azure CAT team describing Parallel File Systems for HPC Storage on Azure and download PDF of the whitepaper.
Lustre is the most widely used parallel filesystem in high performance computing (HPC) environments. This is because Lustre provides POSIX compliance, offers extreme performance when used with hundreds of clients, and can scale up in both speed and storage volume as nodes are added to the cluster. With deployments ranging from just a few systems to thousands of storage servers, Lustre is used with many workloads, among them energy production & seismic modeling, video processing, financial risk analysis, and life-science research. Intel Cloud Edition for Lustre* on Azure is built for use with the virtualized compute instances available from Microsoft Azure scalable cloud infrastructure. Intel Cloud Edition for Lustre* software provides the fast massively scalable storage infrastructure needed to accelerate performance for complex compute jobs and includes CentOS, Lustre, Ganglia, and Lustre Monitoring Tool (LMT).
Many HPC and big data workloads are transitioning from on-premises to Microsoft Azure due to increased flexibility and faster time to results. Lustre can be combined with Microsoft Azure compute instances to address a wide array of use cases. Intel Cloud Edition for Lustre* on Azure is intended to be used as the working filesystem for HPC or other IO intensive workloads. It is not intended to be used as long term storage or as an alternative to cloud storage options. It is recommended to use colder storage mechanisms like Azure Blobs for long term data storage and using Lustre whenever a high-performance shared filesystem is required.
Intel Cloud Edition for Lustre* Architecture on Azure
Lustre filesystem is comprised of a few types of components:
- Management Server (MGS) – stores the configuration for all Lustre filesystems in the cluster
- Metadata Server (MDS) – provides file names, directories, and ACL metadata to clients
- Metadata Target (MDT) – storage device that stores the metadata
- Object Storage Server (OSS) – provides file I/O via network to clients
- Object Storage Target (OST) – storage device that stores stripes of files
- Lustre Clients – Linux servers with a special Lustre kernel module
Intel Cloud Edition for Lustre* on Azure creates multiple virtual machines within a specific subnet (e.g. 10.0.0.0/24) in a new or an existing virtual network. The servers are able to communicate with each other via their private IP addresses. The Management Server includes Ganglia Monitoring System and Lustre Monitoring Tool and is assigned a public IP address making it SSH and HTTP/S accessible over the Internet. By establishing SSH connection to the MGS node, it can be used as a jumpbox to access other Lustre server and client nodes in the same virtual network. Lustre client nodes can be deployed into another subnet (e.g. 10.0.1.0/24) within the same virtual network allowing them to communicate with the Lustre MGS, MDS, and OSS servers using their private IP addresses.
Lustre clients load a special filesystem driver module into the kernel and the filesystem is mounted just like other local or network filesystems. Client applications are able to see a single unified filesystem even though it is composed of multiple server nodes and many disks.
Each Object Storage Server (OSS) server has multiple attached data disks which are each referred to as Object Storage Targets (OST). These disks are backed by persistent VHDs that are stored as Azure Page Blobs in one or more Azure Storage Accounts. Since the goal of using Lustre is to provide shared high performance storage to multiple clients, it is recommended to use Premium Storage for the data disks and DS-series VMs for the compute nodes.
Lustre performance is dependent on the network throughput between the clients and the object storage servers. In Azure, the network bandwidth available to a virtual machine is usually depended on the VM size such as total cores. Therefore, for the highest volume and aggregate throughput workloads, OSS servers should be deployed on the higher-end DS-series VMs. Even though GS-series can provide even more disk and network bandwidth, due to the parallel nature of Lustre it is often possible to achieve similar performance with a few additional DS-series VMs.
Deploying Intel Cloud Edition for Lustre* from Azure Marketplace
You can find the current Intel Cloud Edition for Lustre* Software – Eval in the Azure Marketplace by searching for "Lustre".
Important Prerequisite: Core Quota
For Lustre deployment to succeed, your subscription must have sufficient Azure Resource Manager core quota in the region you want to use
(e.g. 25 cores in West US). Most new Azure subscriptions start with a relatively low per-region core quota while the default Lustre deployment needs at least 12 cores (e.g. 2 DS2 with 2 cores each and 2 DS3 with 4 cores each). To examine your subscription's core quota, you can use azure vm list-usage command in the Azure CLI or Get-AzureRmVMUsage cmdlet in Azure PowerShell 1.x. If your core quota is too low, please go to the portal and create a support request to raise your Azure Resource Manager core quota for the region into which you want to deploy.
Fill in the basic parameters for the deployment such as "VM name prefix", "Admin Username / SSH public key or Password", "Resource Group", and "Location" (e.g. West US).
Step 2: Configure Management Server (MGS)
Provide a globally unique domain name prefix that will be assigned to the Public IP address for accessing the MGS node that functions as the jumpbox for connecting to the other nodes of the Lustre cluster using their private IP addresses. Domain name prefix must be 3 to 50 characters long and should contain only lowercase numbers, letters, and dashes. The domain name suffix (e.g. [dnsPrefix].westus.cloudapp.azure.com) will be automatically updated based on the selected location. Please ensure that the domain name prefix you specify is globally unique (e.g. yourcompany-lustre001)
Select the size of the MGS Virtual Machine. The default size of DS2 should be sufficient for most workloads, but you can select another size.
If you will be using Premium Storage (recommended) to store the data disks, you must select VM series that supports premium storage (i.e. DS or GS)
Step 3: Configure Metadata Server (MDS)
Select the size of the Metadata Server (MDS) node and provide the name for the Lustre filesystem that will be created. The name of the Lustre filesystem (e.g. scratch, lustre, shared_data) will be used when mounting it from Lustre client nodes as mgsip@tcp0:/FILESYSTEM_NAME.
Step 4: Configure the Object Storage Servers (OSS)
Select the number of OSS servers to create in the Lustre cluster. Currently (as of October 2015) Intel Cloud Edition for Lustre* Software - Eval offer on Azure supports cluster sizes of 2, 4, and 8 OSS servers. Configure the size of the OSS servers making sure to select VM series (DS or GS) that supports premium storage if you want optimal, stable, and consistent performance.
In addition to its OS disk, each OSS server will have 3 attached data disks. Select the size of each data disk based on the total space requirements and performance. For premium storage, the larger the disk the higher the maximum throughput.
Step 5: Configure Storage and Network Settings
Enter a globally unique storage account prefix that will be automatically concatenated with a numeric suffix (e.g. 1, 2) to create one or more storage accounts that will be used to store the data disks for all VMs in the cluster. For most workloads, storage account type should be the default recommended "Premium_LRS" (please make sure that you selected DS or GS series for all nodes when using premium storage).
Lustre nodes can be deployed into a new or existing virtual network. For new virtual network, the deployment will create two subnets putting the servers into one (e.g. subnet-lustre-servers) and leaving the second one empty for deploying Lustre clients (e.g. subnet-lustre-clients).
Review the summary of the deployment options and optionally download the deployment template for programmatic re-deployment.
Step 7: Start the deployment
The current (October 2015) Intel Cloud Edition for Lustre* Software - Eval offer does not have software license fees (i.e. it is free from software licensing perspective). However, please remember that Lustre is a cluster solution that requires multiple instances (at least 1 MGS, 1 MDS, and 2 OSS for a total of 4 nodes) and Microsoft Azure infrastructure (e.g. compute, storage, and network) pricing will vary depending on the cluster configuration (i.e. number of nodes, premium or standard storage, etc.)
Accessing the Provisioned Lustre Cluster
Intel Cloud Edition for Lustre* cluster deployment process usually takes 15-20 minutes. Once the deployment is complete, you can view the output of the deployment for some helpful information on accessing the cluster's MGS node.
Ganglia Monitoring System is installed on the MGS node and is publically accessible via port 80. You will want to either restrict port 80 access only to your IP addresses or configure Apache with basic authentication.
You can SSH into the MGS node and run Lustre Monitoring Tool
ltop command to see the details of the filesystem. To learn more about the ltop command run "man ltop" on the MGS node. For example, you can use a variety of interactive commands such as "c" to toggle a condensed view with one line per OSS instead of each OST showing a separate line.
In our example, the filesystem is called "scratch" and consists of 4 OSS servers (lustreoss0, 1, 2, 3) each one with three 512GB data disks / OSTs.
As you can see above, Lustre filesystem "scratch" has not yet been accessed by any clients since there is no current utilization and no granted locks.
Deploying Lustre Clients from GitHub ARM Template
To utilize the shared parallel filesystem that is provided by the created Lustre cluster, you will need to install Lustre client kernel module on one or more supported Linux clients (e.g. CentOS 6.6, CentOS 7.0, SLES11 SP3). Lustre documentation provides a walkthrough describing how to create a Lustre client.
However, in this example, instead of manually creating each Lustre client, we will easily deploy a few Lustre clients using the following open source quickstart template:
The template will create 2 or more Lustre 2.7 client virtual machines and will mount an existing Lustre filesystem. The clients must be deployed into an existing Virtual Network that already contains an operational Lustre filesystem. When deploying this template, you will need to provide the private IP address of the MGS node and the name of the filesystem that was created when Lustre servers were deployed. The clients will mount the Lustre filesystem at a mount point like /mnt/FILESYSTEMNAME (e.g. /mnt/scratch).
On the GitHub page of the template, click the "Deploy to Azure" button to be redirected to the Azure Portal "Microsoft Template" deployment screen and fill in the required parameters.
We will need to know the private IP address of the MGS node which we can obtain from the MGS SSH session via "ifconfig" or "sudo lctl list_nids" or by examining at the output of the Lustre server deployment in Azure portal. We will also need the name of the Lustre servers resource group and the virtual network.
If your Lustre clients need to communicate with each other through the RDMA network InfiniBand (e.g. A8 and A9), they must be deployed into the same Availability Set. In addition, in order to use the InfiniBand and RDMA, you must use the HPC-specific CentOS 7.1 and 6.5 images that include Intel MPI drivers pre-installed. This template deploys the client nodes into an availability set and lets you use either the HPC-specific (CentOS-HPC 7.1 and 6.5) or regular (CentOS 7.0 and 6.6) images. You can see the HPC-specific images highlighted with a red border in the screenshot below.
In this walkthrough, we have the following:
- MGS private IP: 10.5.0.4
- Lustre Servers Resource Group Name: avlustre001
- Lustre Servers Virtual Network Name: vnet-lustre
- Lustre Servers Virtual Network Clients Subnet Name: subnet-lustre-clients
Once the client deployment is complete, you can view the Outputs section to see the domain name of the Public IP of CLIENT0 that we will use to SSH into it:
On CLIENT0, use the "df -h" command to view the currently mounted file systems, and confirm that Lustre is mounted at /mnt/scratch
In addition, we can see the contents of the shared Lustre directory to confirm that client0 and client1 were both able to write there a 200MB test file:
When Lustre clients restart, they will automatically re-mount the Lustre directory because of the following record in the /etc/fstab file:
Lustre File Striping
In many Lustre use-cases, a technique called file stripping will increase IO performance. For example, file stripping will improve performance for applications that do serial IO from a single node or parallel IO from multiple nodes writing to a single shared file. A Lustre file is stripped when read and write operations access multiple Object Storage Targets (OSTs) concurrently. File stripping increases IO performance since writing or reading from multiple OSTs simultaneously increases the available IO bandwidth. For instance, if there are 4 OSS servers each with 3 attached data disks, a file can be stripped across 12 devices. This will allow multiple clients to read portions of the file from distinct OSS servers providing higher aggregate network bandwidth and higher overall disk throughput. In addition, by placing chunks of a file on multiple OSTs the space required by the file can also be spread over the OSTs. Therefore, a file's size is not limited to the space available on a single OST. However, it is important to remember that by placing chunks of a file across multiple OSTs, the odds that an event will take one of the file's OSTs down or impact data transfer increases.
Selecting the best stripping depends on the specific workload and requires performance testing since striping a file over too few OSTs will not take advantage of the system's available bandwidth but striping over too many will cause unnecessary overhead.
View striping information for a directory: lfs getstripe /mnt/scratch
Set striping for newly created files (Lustre cannot change an existing file's stripe size): lfs setstripe --count -1 /mnt/scratch/dir1
(setting count to -1 means stripe over all available OSTs)
Please see Intel Lustre Manual to learn more about Lustre File Striping.
By now you should have a good understanding of how to deploy the Intel Cloud Edition for Lustre* from Microsoft Azure Marketplace (i.e. the Lustre servers) and how to deploy Lustre clients to actually mount and use the filesystem. With some work, you can modify both the Azure Marketplace "server" template and the GitHub quickstart "client" template to even better suit your specific needs (e.g. creating more servers, using a custom image for clients, etc.)
I'm looking forward to your feedback and questions via Twitter https://twitter.com/ArsenVlad