MySQL Cluster IaaS best practices for Azure – High Availability, Fault Tolerance, Scalability


Introduction

MySQL Cluster, if used correctly, has proven to be one of the most fault tolerant, scalable and real-time distributed Active/Active-capable enterprise data storage solutions. Web providers with tons of daily visits like Paypal and Zillow, as well as leading telecommunications and network equipment providers, including Alcatel-Lucent, AT&T, BT, and Ericsson have successfully deployed MySQL Clusters. You can listen to Paypal Chief Architect talk about using MySQL cluster for Paypal fraud detection in this video starting at the 24 minutes 20 seconds. Check out the Zillow case study here. They use the MySQL Cluster Carrier Grade Edition ("CGE") which includes support and useful enterprise tools - but the MySQL Cluster Community edition is free to try out and play around with.

Why another best practice guide?

So why another best practices guideline for MySQL Cluster? Because when it comes to deploying and getting the best out of MySQL on Azure as an IaaS solution, we have to keep certain things in mind that is not necessarily covered very well out there today. Oracle has a whitepaper that is intended to be on the topic of MySQL for cloud - it is called "A Guide to MySQL for SaaS". However, I found that it is more of a marketing material than being a technical guide. Also, it covers all MySQL offerings broadly, and none very deeply, including MySQL Cluster.

Distinguishing MySQL Cluster from MySQL

Which brings me to the distinction - an important one for this blog post - MySQL Cluster is not the same as MySQL Database (which, in short, is just "MySQL"). They do not follow the same release cycle and version number patterns. As of this blog post, the GA version of MySQL Cluster is 7.4, but the GA version of (simply) MySQL is 5.7. MySQL Cluster does not use the same engine as MySQL - it uses a different engine called "NDB Cluster". MySQL Cluster does not store the data co-located with the mysqld daemon, it uses separate data nodes. It is a very different beast in that sense. Of course, it is still the same in one sense, but the difference is important to note for this blog post, because we are going to explore MySQL Cluster on Azure. On the contrary, this article and this blog post of mine deals with certain best practices of using MySQL on Azure.

PaaS versus IaaS MySQL Cluster for Azure

ClearDB now provides PAAS MySQL Clusters on Azure as well. In fact, their offering seems very robust and technically a lot more attractive than managing your own instances of MySQL Clusters because of the recently announced option of Premium Dedicated Clusters. I say that because personally, PaaS is always more attractive than IaaS, because you do not have to worry about maintaining/ monitoring/ baby-sitting/ upgrading your infrastructure, you do not have to worry about the up-time SLA or handle scaling your VM-s yourself. However, I am not sure of how ClearDB stacks up against Azure IaaS cost-wise. Though I have not done an accurate cost comparison analysis between these choices, it is clear to me that MySQL Cluster demands Azure VM-s of bigger sizes because of its memory needs. See Point 3 in the next section "What about Performance?". Therefore, none of the options promise to be cheap, but I am sure both of them are cheaper than having on-premises servers which tend to be severely under-utilized and almost always end up having a higher TCO because of man-hours spent in maintaining these servers, money spent behind refreshing hardware, etc.

This blog post is about IaaS - where you set up the MySQL Cluster yourself on Azure. I wrote another blog post about the actual process of creating a MySQL Cluster on Azure. I will not cover the same here again. This one will be about architectural best practices regarding high availability, fault tolerance and geo-replication.

What about Performance? 

Today, we will cover High Availability, Fault Tolerance, Scalability and some thoughts on Geo Replication related to MySQL Cluster on Azure. We will leave the performance for another day. Why? A few thoughts:

  1. There are a lot of articles on the internet that focuses on performance optimization, fine tuning and cluster configuration for optimum MySQL cluster performance. Most of these techniques are specific to MySQL Cluster and will work regardless of where it is running – on-premises or cloud, AWS or Azure. However, in my opinion, HA and Fault Tolerance aspects are different on the cloud, and need special considerations.
     
  2. You will hear a lot of great things about MySQL Cluster performance. Sometimes it gets difficult to sieve through results published by people trying to sell something versus people stating facts in a blunt way. Not every product is suitable for every use case. You get a feature at the cost of another, trade-offs are made - but these trade-offs are sometimes not discussed openly. What is the trade-off involved with MySQL Cluster? This is a quote from an official MySQL Cluster white paper: "MySQL Cluster is a distributed, shared-nothing data store that provides very high, scalable performance for applications that largely use primary key access and have high concurrency, but it incurs a cost in network access when accessing data between a MySQL Server and tables distributed across Data Nodes". Well of course it does! If you are writing your data spread across multiple data nodes which can be spread out in different regions, you will incur a latency cost. It will slow things down. Software is not magic. So why do we do it? Well, if the benefits outweigh the costs, then sure why not? But the important point to note here is this: Every piece of software tool out there wants to be "enterprise grade" - meaning they want to be fault-tolerant, BCP-enabled, DR-enabled, able to scale up and down as well as in and out. But for an ACID compliant database, this does not come free - they come at the cost of latency and performance. Therefore, as MySQL evolved into MySQL Cluster, some performance had to be sacrificed. But good engineers do not just accept that - they provide knobs and levers to adjust this and adjust that, so that users can find the sweet spot between performance and features. The same is true for MySQL cluster. When JOIN performance started suffering because of data being distributed across several nodes, the MySQL Cluster team came up with AQL in version 7.2. Various configurations are provided to fine tune performance, sometimes at the cost of durability, sometimes at the cost of RTO/RPO. However, most of that is extremely well documented. The focus of this blog post happens to be HA and Fault Tolerance because these are the reasons why we have so many performance levers in the first place!

  3. In the cloud, performance of MySQL Cluster will depend on, among several other factors, the size, CPU and memory of the Virtual Machines you are running the data nodes on. MySQL uses a lot of memory as it keeps a large part of the database in memory for performance. Everything has knobs and levers of course, but the recommended RAM size on MySQL Cluster data nodes for a 50 GB database is 64 GB (see page 20 of this whitepaper). In terms of Azure VM-s, a D14 or G3 will be the right choice. It is this realization that makes me say that ClearDB is worth a shot - because when you spin up large VM-s, you want density and utilization. None of the two options (MySQL Cluster PAAS versus IAAS) will be very cheap, but in most cases either will be cheaper than having on-premises physical servers or 3rd party hosting.

 

Evolution of HA in the MySQL product line


Before we jump into the meat of this discussion, there is one other thing that should be kept in mind, and that is: MySQL Cluster represents the peak of MySQL's HA journey - 99.999% up-time guarantee if the architecture is done correctly. There are several other offerings that are less complex and provide less HA. Before studying MySQL Cluster, you should study these other offerings because they will be less complex to set up and definitely cheaper. If one them happens to suit your needs, there is no need to over-provision yourself in that case.

The best way to understand all these other offerings leading up to MySQL Cluster, in my opinion, is to follow the evolution path. Imagine the single server MySQL Database. It is a good ACID database, but enterprises do not just need a database to be ACID compliant - they are in constant fear of vandalism, earthquakes and dinosaur level extinction events. They want to replicate. They want Disaster Recovery with low RTO and RPO. So backing up the database once a day is no longer sufficient, it needs to be continuous.

Stage 1 - single server MySql with replication

Hence the single server MySQL Server evolved to provide simple replication mechanisms - an inbuilt feature that does not need any plugins/ add-ons. It is a simple master-slave topology. Replication can be synchronous - where the write on the master will be committed only when the write gets committed on the slave, thereby ensuring 100% consistency of the replica but at the cost of write performance - or it can be asynchronous, where the master propagates the changes out to the slave after the fact, thereby improving write latency but increasing the chance of data loss in case of disaster. The master-slave topology also provides the additional benefit of directing readers to the slave if they can work with slightly stale data, thereby reducing the load on the master, which can then funnel these resources to write operations, thereby isolating the two operations from each other.

While that sounds wonderful, there came a day when that was inadequate. As financial transactions moved to the web, even more rigor was required from the replication process. Compliance to stringent standards required multiple replicas. Keeping all of them in sync was a challenge. On top of that, who would oversee the replication itself? What if replication was failing silently for days? It would leave a bad taste in the mouth if we found that out only after the replica was actually needed. And the million dollar question of availability then started glaring people in the face: how quickly can we fail over? What use is a replica if we cannot quickly fail over to it? As every second of availability became critical to businesses, the need for sub-second fail-over arose, and with that came the need for near-real-time replication without sacrificing performance (what use is a quick fail-over if the replica is stale?)

Stage 2 - clustered and virtualized systems

At this stage, the simple master-slave replication evolved to LAN/WAN based wider "failover clustering" approaches. Balance was achieved between completely asynchronous replication (which improves write performance but increases possibility of stale reads from the slave) and completely synchronous replication (which degrades write performance but improves the chances of consistent replica on the slave). This balance was called "semi-synchronous replication" where the master commits a write when the change just reaches the slave - thereby re-defining the meaning of error and success. Multiple copies of the replica are handled in a "failover cluster", with a "cluster resource manager" constantly monitoring each copy using a heart beat mechanism, ensuring the health of the copies themselves, so that failover, when and if needed, is quick and seamless. All such mechanisms constantly monitor hardware, OS, network and database processes so that it can automatically fail over to standby servers in the event of a failure being detected, redirecting applications to new master as part of the process. Automatic failover was one of the biggest wins in this step of the evolution - lower downtime, less losses. Such solutions provided a new breed of HA-aware MySQL products: Oracle Clusterware (on Linux/ Solaris), Oracle VM Template (on Linux), Solaris Cluster (on Solaris), Windows Cluster (on Windows) and DRBD (on Linux). You can explore some aspects of these solutions in this white paper, which calls such systems "Clustered and Virtualized systems"

While that sounds even more wonderful, limitations still exist in the above systems. Availability guarantees are still lower than what mission critical and massive for-profit organizations needed. As business started operating globally, their user bases started interacting with the database from multiple regions. The concept of master-slave suddenly felt inadequate because the traditional master-slave arrangement only allows writes against the master. All of a sudden, the need for master-master arose, because we now needed the ability to write from different parts of the world. Multi-master topology brought its own challenges, like data consistency issues due to update conflicts.

Stage 3 - MySQL Cluster

MySQL Cluster, therefore, represents the peak of this evolution as of today. It attempts to resolve the above challenges, and provides 99.999% availability guarantee if the architecture is right - that comes to less than 5.5 minutes of downtime each year, including planned maintenance. I will not go into the details of how MySQL Cluster actually works. My earlier blog describes the topology of a MySQL Cluster. Consider reading this whitepaper to find out how MySQL Cluster attempts to solve these problems. MySQL Cluster is shared-nothing and geo-replicated. It has three layers - management nodes, application nodes (aka SQL Nodes), and data nodes. The most interesting part is the data nodes. We will soon see how data is spread across the data nodes for fault tolerance.

This picture shows the evolution I just described:

 

 

What does High Availability and Fault Tolerance mean w.r.t MySQL Cluster?

From a MySQL Cluster perspective, it means that no single failure can result in the loss of service. That guarantee, however, is dependent upon the architecture of the cluster. Not all architecture can provide that guarantee. That is one of the things that we are going to explore – how to architect the MySQL cluster on Azure so that its intrinsic guarantee of HA/ BCP is realized.


An important point to note is MySQL Cluster’s CAP Theorem stance. In case of catastrophic network partition, does MySQL choose availability or consistency? The answer is consistency (as is truly becoming of an ACID compliant database) It shuts the cluster down to prevent inconsistency in data. Therefore, it is even more important to architect the cluster properly – because a MySQL cluster with improper architecture will shut itself down in case of node or network failure, thereby failing to provide HA. Hence, the 99.999% guarantee will not be reached. That is why architecture plays a big role in determining your HA mileage out of MySQL Cluster, 99.999% being the best case scenario.


The Data Node Algorithm

MySQL Cluster, as I described earlier, has at least three layers, not counting the clients that connect to it. In the diagram below, borrowed from http://www.handybackup.net/, the "storage nodes" is the most important layer - we will call them "data nodes". The "query processing nodes" are the ones where MySQL daemon (mysqld) runs - which means clients connect to this node to run queries against - we will call them "SQL Nodes". The management nodes run the cluster orchestrating processes - they distribute workloads and arbitrate in case of node failures in case leader election is needed:

Two very important points to note about MySQL Cluster is that data is written to the data nodes (a) auto-sharded (partitioned) and (b) replicated in a synchronous manner.

In the config.ini file in the management node (typically /var/lib/mysql-cluster/config.ini), there is a key "NoOfReplicas". In the same file, the data nodes are defined - so the cluster knows the number of data nodes. From these two pieces of information (#replicas, #data nodes), MySQL Cluster calculates the number of node groups.

#Node Groups = #Data Nodes / #Replicas

It then arranges the data nodes into as many node groups. A node group is defined as the group of data nodes which contain the same partition of data, replicated {#Replicas} times. Inside a given node group, MySQL cluster establishes a primary node for a given partition, and one or more secondary nodes for the same partition - and it does this in such a way that each partition has a distinct primary and distinct secondary(ies).

Confusing? Let us try this diagram (click on it to make it bigger):

In this example, #Data Nodes = 12, #Replicas = 3 (the two inputs).

Therefore, here are the computed counts:
# Node Groups = 12/3 = 4 (marked as Node Group 1, Node Group 2, Node Group 3 and Node Group 4 in the diagram, with a-l being the names of the actual data nodes)
# Partitions per table = #Data Nodes = 12
# Partitions per node group = #Replicas = 3

What that means is that a given table will be partitioned into 12 parts (based on the hash of the primary key, unless a different algorithm is provided). In the above example, the table shown is, therefore, split into 12 partitions (12 sets of rows): Partition 1, Partition 2, ... Partition 12.

Out of these 12 partitions, 3 are selected for Node Group 1, 3 selected for Node Group 2 and so on. In the above example, say Partitions 1, 2 and 3 are selected for Node Group 1. These 3 partitions are stored in the Node Group 1 - but remember that Node Group 1 has 3 data nodes. Therefore, MySQL Cluster chooses one primary and two secondaries for each of Partitions 1, 2, and 3 in such a way that no single node in that group is the primary of more than one. See the table embedded in the diagram. One way to read the table is: Data Node a is the primary for Partition 1, hence all reads/writes for rows in this partition will first come to data node a, and then will be synchronously replicated to data nodes b and c.

This is a simplified algorithm - but one that is sufficient to explain HA in MySQL Cluster.

Fault Scenarios - the goal of MySQL Cluster Architecture

So what happens when a network partition happens and the data nodes can no longer talk to each other? Or one or more data nodes fail, thereby severing communication with other data nodes?

The data nodes are split into separate communities - each community containing nodes than can talk to each other. Each community follows this flow chart (again, clicking should give you a bigger diagram to look at):

The goal of MySQL Cluster architecture is to reach one of those two green circles for all possible single point failures. If you reach one of the red ones, you lose service, and the promise of 99.999% availability is vaporware.

At this point, I urge you to, if you have not yet done so, go through this excellent blog to understand how the architecture would impact availability if you were to do this on physical servers.


Azure specific thoughts

 

  1. This is the recommended architecture on Azure (assuming that this cluster is contained in one single data center):

    Here is why: Because each node group is in its own Availability Set, at least one of them will always be up and running (unless the whole Azure data center goes down, we will talk about that later). Therefore, any residual community of data nodes after one or more failures will always be a viable community of data nodes (see definition of viable community in the flow chart above, "viable community" means a group of data nodes where there is at least one node from each Node Group). Therefore, as we traverse the flow chart, we come down to the second diamond. At this point, it either continues operating, or even if it needs arbitration, it will always be able to connect to the management node, as we have two of them sitting in an availability set - so one of them will always be available. Therefore, loss of service is prevented under all conditions except loss of entire cluster (i.e., entire Azure data center)

  2. In order to protect your data from catastrophic disasters which can take an entire Azure data center down, you can choose to employ MySQL Cluster's geo-replication feature in Azure. Geo-replication provides geographic redundancy - while the internal (synchronous) replication provides High Availability between data nodes co-located within the cluster, the asynchronous inter-region replication to a remote site guards against site failures

  3. A very interesting option for MySQL Cluster with Azure is the "Multi-site clustering", where two data nodes belonging to the same cluster can reside in two regions. This is normally not going to fly, as replication between data nodes is synchronous. Synchronous inter-data-center writes are normally prohibitive in terms of latency. However, Azure provides the ExpressRoute - which can provide up to 10 GBPS of throughput between Azure data centers (or even your premises and Azure cloud). A typical use case would be when we need writes happening at both ends - Active/Active configuration with no need for conflict resolution, as MySQL handles conflicts as an inbuilt feature (using time stamps and other techniques). When the user base is geographically distributed and writes must be supported from both ends, this is a viable architecture. Azure ExpressRoute can make this possible, though I do not know of anyone who has actually used MySQL Cluster this way. But MySQL Cluster provides the feature and you can give it a try on Azure!

    This option is also a viable one if the two sites are in the same geographical regions (like Azure East and East 2) - and for some reason we want to set replication up, say, for example, between corporate locations.

  4.  The other option for writes happening at both ends is Active/Active Multi-Master geo-replication on two Azure data centers. This would typically increase your compute costs compared to the previous option ("Multi Site Clustering"), but the previous option will need Express Route, while this option may be feasible using IPSEC tunnels. Again, MySQL Cluster supports full multi-master Active/Active geo-replication where both clusters are active - and there is no need for application level conflict resolution. Replication can happen asynchronously or semi-synchronously in this case. An important aspect to remember about Azure is that traffic between Azure data centers traverses Microsoft owned network backbones. This is key to high geo-replication performance - one of the several ways that Azure makes it easier to use MySQL Cluster.

  5.  Another use case where geo-replication make sense on Azure is reporting. Several reporting tools can act on a MySQL data store and run complex queries - including Power BI. However, the recommended MySQL Engine optimized for such a use case is InnoDB, not NDB Cluster. We can use Active/Passive asynchronous geo-replication to offload the data to an InnoDB engine store where we can point our report and dashboard generation applications to

  6.  For applications that read/write/update a bunch of records per transaction, it is recommended that we build in "distribution awareness" in the application - where partitioning happens based on a sub-key common to all the rows being affected by the transaction (e.g., transaction or session id) - so that all records go to the same data node. This is especially required if we are using Multi Site Clustering

  7.  Azure can provide the perfect platform for real time applications that need fast propagation of updates across nodes or clusters or regions. Basically, applications that need low latency real time updates (e.g., reflecting status updates on financial transactions) either across the nodes of a cluster or across multiple clusters, can use a feature in MySQL where most or all data can be stored just in memory (RAM). This provides millisecond level cross node/cross cluster updates, but the cost incurred is durability/RTO/RPO. On a typical on-premises installation, power outages taking down the whole cluster will result in loss of the data that was still held in RAM. MySQL provides a way to mitigate this problem – continuous/ frequent disk-check-pointing and redo-logging, but that can reduce performance as it uses CPU and disk. However, a commercial cloud like Azure can take these trade-offs away. A MySQL cluster can be easily deployed with individual nodes distributed across availability sets, thus protecting the cluster from power/ rack failures. Therefore, real time applications can use the in-memory feature without the fear of losing data – and also without the need for continuous disk check-pointing and redo-logging which compromise performance

  8. MySQL Cluster's operational agility features (online scaling by adding more slaves or nodes) are very nicely complemented by Azure auto-scaling features

Conclusion

In my opinion, ClearDB is a good example of what can be done with MySQL on Azure. They are running a successful commercial PAAS MySQL Cluster service on Azure, which is simply fantastic. As we work with more and more partners to deploy complex clusters on Azure, I will come back and update this blog. Thanks! 

Comments (1)

  1. Eagle.Zhao says:

    Could we can translate your article to zh-CN version ? Appreciate Thanks

    Yours sincerelyEagle , My Email: eagle.zhao@outlook.com

Skip to main content