Set up Azure Ubuntu VM with XFS on RAID

 

Summary:

If you are running applications like MySQL / MongoDB / Cassandra / Kafka or any other custom application that accesses data disk(s) on an Azure Linux Virtual Machine frequently for storage or Disk I/O, you may consider using a RAID array of data disks to improve performance by increasing concurrency of reads/writes. RAID 0 should be sufficient because Azure Storage provides the redundancy guarantees for your data disks that other nested or more complex RAID levels are designed to provide.

This blog shows you how to set up an Azure Ubuntu VM with 4 data disks striped to create a single logical RAID 0 volume. On top of that, it also shows you how to mount this RAID device using XFS file system. The bigger your RAID device gets, the more data you are managing on it. XFS, the high performance file system with built-in scalability and robustness can get you further performance mileage on top of RAID. RAID + XFS is a good combination of hardware and software coming together to boost performance on the cloud!

  

Target Audience

  • Cloud Architects
  • Technology Decision Makers
  • Senior Developers
  • Developers

Prerequisites

  • Azure subscription

 

Steps:

  1. Use the Resource Manager deployment method to create an Ubuntu Linux VM from Azure Preview Portal. Depending on how many data disks you want in your RAID array, you have to choose the right size of VM because the maximum number of data disks that you can attach to a VM varies with the size SKU. In this example, I show you how to achieve a 4 disk RAID 0, therefore you should use, at the minimum, A2 if you are using A series.

    [Note: Though it is technically possible to have nested or more complicated RAID configurations (like RAID 10), it is not necessary because Azure data disks are backed up by LRS/GRS storage account backup and redundancy policies. RAID configuration being discussed here is strictly for performance increase by increasing concurrency of writes and reads, not for redundancy. Hence, we will discuss RAID 0 only - in which case striping will let us write to several disks or read from several disks in parallel]

  2. Log into the newly created Ubuntu VM over SSH using Putty. if you are not sure how to do so, here are the steps to connect using PUTTY if you have set up authentication mechanism as simple username/ password during VM creation: Steps to use Putty to log into your Azure Linux VM

  3. Azure, by default, gives you 2 disks on the newly created VM - the OS disk and the temporary storage disk. The OS disk has the OS (as the name implies). The temporary storage disk is just that - do not use it for persistent data. It is not backed up or replicated as part of the underlying storage account. If the VM is rebooted, contents of this temporary storage drive may be erased. Even if it survives reboots, data on temporary disk will not survive Host OS failure or VM resizing!

    [Note: The temporary disk is a great place for Linux swap file. If you are trying to get the most out of your Azure Virtual Machine while running memory intensive applications, feel free to add a swap file and point to the temporary disk (mounted as "/mnt") as its location. This blog shows how to add a swap file after a Linux Azure VM is provisioned. Part 2 of the same MSDN blog shows how to configure swap space automatically at VM boot time by configuring the Linux VM Agent. For reference, Linux VM Agent is an open source python script that handles all system level communication of an Azure Linux VM with the hypervisor.]

    1. Check that you have these 2 disks mounted. You can do this in a couple of ways, each way providing you with some unique information

      (1) Run "sudo fdisk -l" - You should see the partitions /dev/sda1 (on the OS disk /dev/sda) and /dev/sdb1 (on the temporary storage disk /dev/sdb)
      (2) Run "df" - You should see that the temporary disk partition (/dev/sdb1) is mounted as /mnt (this is the default behavior in the Linux VM Agent). Remember this while developing/ configuring your applications
      (3) Run "sudo grep SCSI /var/log/syslog" - you should see 1 line of system log for each of the 2 disks being attached to the VM

  4. Use the Azure Preview Portal to add a data disk (Properties --> Disks --> Add/Attach New):

    1. Set "Host Cache Preference" to None. It is recommended that you enable this caching at the Azure Host level only for light read workload. For the kinds of applications we are dealing with here (heavy writes), and for big files (more than 20 GB) - this should be left disabled regardless of read workload. This is true for both Standard Storage disks and Premium Storage disks

    2. In this example, I used size 250 GB for this new data disk (my eventual goal being to add 4 data disks totalling to 1 TB of persisted replicated storage)

    3. Choose Type "Standard" (unless you want Premium SSD Drives for even greater IOPS and performance gain)

      [Note: You can automate this step using Azure CLI or Powershell]

      [Note: After you add a new disk from the portal (or CLI or Powershell), the Putty session may become unresponsive for a while. Azure fabric restarts a few system services in order to add a disk. You may also lose Putty connection. Even after connectivity is restored, the verification steps listed below in step 5 may not show the new disk for up to 1 minute or so]

      [Note: More on Azure Data Disks: You can add up to 16 additional disks on A Series, 32 on D and DS Series and 64 on G Series up to 1 TB each, depending on VM size. Hence, the total maximum amount of additional disk space is 16 TB for A Series, 32 TB for D Series, 32 TB for DS Series and 64 TB for G Series. Each disk has a performance target of 300 IOPS (for Basic VM type), 500 IOPS (for Standard VM type with Standard storage), and up to 5000 IOPS with Azure Premium Storage. It is recommended to add additional disks as needed per space and IOPS requirements and do not use the OS disk (OS disks are usually small and optimized for faster boot times). For optimal performance, it is recommended that you use storage accounts in the same Azure data center as your VM.]

      [Note: More on Replication Options for Data Disks: Also, the storage account for your data disks have several replication options: LRS (Locally Redundant Storage, 3 copies), GRS (Geo-Redundant Storage, 6 copies), Read-Only GRS and ZRS. Premium Storage only offers LRS. Even for standard storage, I recommend saving some costs and choosing LRS for multi-data-disk scenarios, as replication across geographical regions do not come with any consistency guarantee between the disks - Replica of Disk 1 may be inconsistent with the replica of disk 2 in the remote region, as inter-region replication is asynchronous and only focuses on maintaining integrity of the particular blob representing the VHD of a single disk]

  5. After attaching the data disk, you can verify that it has been added to the VM, but no partitions have been created on it yet by the following mechanisms:

    1. "sudo grep SCSI /var/log/syslog" will show you a new line for the new disk (/dev/sdc if you are adding your first data disk)

    2. "sudo fdisk -l" will now show you a new disk (/dev/sdc if you are adding your first data disk), but it will also say "Disk /dev/sdc doesn't contain a valid partition table"

  6. Run "sudo apt-get update" followed by "sudo apt-get install mdadm", if prompted for configuration change, select "No configuration change"

  7. Create a partition (/dev/sdc1) on your new disk /dev/sdc:

    1. Run "sudo fdisk /dev/sdc"
       

    2. Commands to use for fdisk (in that order) when fdisk enters interactive mode accepting commands from you:

      1. n (add a new partition)
      2. p (the partition will be the primary partition on the new disk)
      3. 1 (it will be partition number 1)
      4. Nothing (accept default) for 1st sector/ cylinder (your partition will start at the default offset into the disk)
      5. Nothing (accept default) for the last sector/ cylinder (your partition will end at the default offset into the disk)
      6. p (Optional, review before committing)
      7. t (To change the partition's system id so that it can participate in a RAID array later on)
      8. L (Optional, to review the list of available system id hex codes, we will use 'fd')
      9. fd (change the system id for the new partition to 'Linux raid autodetect')
      10. p (optional, review before committing)
      11. w (Commit)
    3. Run "sudo fdisk -l" and verify that now, fdisk shows the new partition (/dev/sdc1, which is still unmounted, so "df" will not show it yet)

      [Note: It is possible to create multiple partitions on a data disk (like /dev/sdc1, dev/sdc2 and so on). But for the sake of this discussion, we are not going to explore that as we are going to create a RAID with all our partitions and it will appear as if it is a single partition anyway]

  8. Repeat steps 4, 5 and 7 above (skipping step 6) thrice to add 3 more disks, and create the partitions /dev/sdd1 (on disk /dev/sdd), /dev/sde1 (on disk /dev/sde) and /dev/sdf1 (on disk /dev/sdf)

  9. Once all 4 disks have been added to your Ubuntu VM, and 1 partition has been created per data disk spanning the whole disk for each case, we will stripe them together in RAID 0 configuration using the mdadm utility:

     <start code block> sudo mdadm --create /dev/md127 --level 0 --raid-devices 4 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 <end code block>
    

    [Note: This command will create a new RAID device, /dev/md127 - a logical disk that stripes all the 4 data disks you added to the VM. Run "sudo fdisk -l" to see that /dev/md127 is now being listed as a logical disk]

    [Note: You may need to use --force parameter with the above mdadm command if you had previously created a different RAID configuration using your disks]

    [Performance tips: It is possible to modify the chunk size of the raid array by passing a non-default value to -c (or --chunk) parameter, default being 512 KB. Read/write performance varies with chunk size, but the difference is not much. I have seen conflicting advice - some experiments have concluded that 64K chunk size provides the best performance. However, experiments on disk performance (IOPS) conducted by the Azure team has shown that the default chunk size of 512 KB gets the maximum IOPS. See here: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-optimize-mysql-perf/#AppendixC I intend to publish a second blog after I run my own tests regarding chunk size.]

    [Note: Use the command "cat /proc/mdstat" to view the stats/ details of the RAID array, It also reports the chunk size (stripe size)]

  10. Put a file system on the new RAID device.

    Here, if you are putting ext4, run "sudo mkfs -t ext4 /dev/md127"

    However, we will put XFS file system, a high-performance file system that can handle a massive amount of data, is architecturally scalable and robust. Oracle Linux and RHEL have now adopted XFS as their default file system. Hence:

    1. sudo apt-get install xfsprogs (installs XFS binaries)
    2. sudo mkfs.xfs -f /dev/md127 (formats the new RAID 0 logical drive with XFS with default file system configuration)
  11. Mount your new file system as /RAID0 (or any other suggestive name)

    First create the directory:
    sudo mkdir /RAID0

    If you have used EXT4 file system, use the command:
    sudo mount /dev/md127 /RAID0

    If you have used XFS, use the command:
    sudo mount -t xfs /dev/md127 /RAID0

    Check whether mounting was successful using "df" (or "df -Th /RAID0" to see information about only the RAID 0 device). Now, your RAID 0 device is ready to be used with XFS file system at location /RAID0

  12. Make the mounting of this device persistent across reboots by adding a line to /etc/fstab:

    Find out the UUID of the RAID 0 partition using the blkid utility. On Azure, it is recommended that you use this UUID instead of the name ("/dev/md127") so that the file system is correctly mounted after reboots:

    sudo -i blkid

    Copy the UUID reported for /dev/md127. Say it is "f8adfafd-c008-4c5f-971e-57407af9bfbb".

    For ext4, add the following line (or its variants) to /etc/fstab:

     #RAID 0 device 
    UUID=f8adfafd-c008-4c5f-971e-57407af9bfbb /RAID0 ext4 defaults 0 0 #IMPORTANT TO GET THIS RIGHT
    

    For xfs, add the following line (or its variants) to /etc/fstab:

     #RAID 0 device 
    UUID=f8adfafd-c008-4c5f-971e-57407af9bfbb /RAID0 xfs rw,noatime,attr2,inode64,nobarrier,sunit=1024,swidth=4096 0 0 #IMPORTANT TO GET THIS RIGHT
    

    Test the entry in /etc/fstab by unmounting and re-mounting /RAID0:

    sudo umount /RAID0
    sudo mount /RAID0

    If there was a mistake in the line(s) added to /etc/fstab, mounting will fail. /etc/fstab is very sensitive to errors and it is very important to get it correct - otherwise your VM may not reboot back again. If you want to make the VM resilient to mounting errors while rebooting, consider using the "nobootwait" or "nofail" options on the /etc/fstab line after carefully understanding what they do and how they will impact your application functionality.

    [Note: Familiarize yourself with the various columns in the /etc/fstab entry. The most important column is the one that specifies all the mounting options. These options control performance of your file system and disk. See XFS performance tips below]

    [Performance tips for XFS: The number of options you can specify in the 4th column in the /etc/fstab entry can be mind-boggling. XFS is generally very efficient in choosing the right defaults for the best performance. However, sometimes, you may need to tweak these options for improving performance .

    1. UUsually, noatime (which implies nodiratime) gives the best performance as compared to "atime" or "relatime", but it can be impossible to use it if your application requires the logging of access times for security or audit. Usually, "relatime" is a good compromise between "atime" and "noatime"
    2. Option "attr2" is an opportunistic performance improvement option (in case your application is using extended attributes, it has no effect if extended attributes are not used)
    3. For large disks (> 1 TB), "inode64" can significantly improve write/read performance, but it can be used only if your applications do not expect 32 bit inodes. Older applications, especially over NFS, may have problems with inode64. In our example, we are creating a 1 TB RAID 0 drive, so inode64 was actually not needed here
    4. Option "nobarrier" improves write performance drastically, and is recommended for Premium Storage. Premium Storage is durable in nature and will preserve writes across power failures/ system crashes etc. (In fact, Premium storage comes with same degree of durability as LRS storage and keeps 3 copies of data). Therefore, barrier support from the file system is not necessary and will improve write performance dramatically if you remove it. Even for Standard storage, removing barrier support will get a lot of performance benefits. If you follow Azure's recommendation of using availability sets (where you have redundant VM-s on different fault and update domains to get 99.95% availability SLA), you should be able to remove barrier support with the peace of mind
    5. Calculation formulae of "sunit" and "swidth" can be easily found on the internet, and manual setting is only required when we are using XFS with RAID. Newer XFS libraries calculate the optimum values automatically, but when using RAID, it is a good idea to set them manually. Quick formula without explanation: sunit = stripe size (RAID chunk size) in bytes divided by 512, swidth = sunit * n, where n for RAID 0 = number of disks (4 in this example)]
  13. Consider making your VM resilient to RAID failures by using the kernel parameter "bootdegraded=true". Different Linux distributions have different ways of how you can pass this parameter to the kernel. Research and understand the implications of this parameter before you use it