- Part 1: The Need for Orchestration and the Role of the Cluster Resource Manager
- Part 2: How the Cluster Resource Manager Works
Last time we talked about why Orchestration is an important problem to solve when you are running services, and gave an overview of the architecture of the Service Fabric Cluster Resource Manager.
In this post, I’ll cover more about how the Cluster Resource Manager works, how it’s implemented, and what jobs it really has in the cluster. But first, we’re have to deal with a hard Computer Science fact, which is that…
NP-Hard is Hard
…or at the least incredibly complicated. Looking at the jobs of the Orchestrator from Part 1 and thinking about what it actually has to do, it’s clear that we have a classic optimization problem on our hands; something akin to the famous Knapsack Problem where we have to figure out how much of something can fit inside a particular container. Really It’s much more like a combination of the multiple-knapsack, multiple-dimension problems, but we’re getting ahead of ourselves…
How does the Service Fabric Cluster Resource Manager arrange all of the different services and ensure that the overall cluster is utilized efficiently? How does it ensure that in organizing the cluster it doesn’t accidentally put some service or workload at a disadvantage? If you care about resources like the amount of disk space or memory, how does the Service Fabric Cluster Resource Manager know how to arrange the cluster without overloading individual machines?
Conceptually, we are in one of the valleys shown in Figure 1 and we have to “climb the hill” to find the best arrangement for everything in the cluster. We’ll say for now that a better allocated cluster has a higher score than a less well allocated cluster. For example, an arrangement with all of the services stacked on one node while a bunch of other nodes are empty would correspond to a deep blue trough, while a better arrangement with everything distributed would be a red peak.
Note: If you’re versed in topologies (or even just remember trigonometry) you’ll notice that the graphs below are combinations of simple trigonometric functions that I generated with js-surface-plot over at jsfiddle. They are presented for illustration only as an aide to thinking in multiple dimensions.
Figure 1: A nice, simple environment - get to the top of the hill
But it’s actually worse than just a regular search problem: suppose we find the optimal arrangement for a particular metric, like CPU. That sounds good, but in reality services have many different resources they care about (CPU, Memory, Disk Space, “OutstandingRequests”, etc.), so there’s really many different topologies to consider and many different goals to be met simultaneously. In production clusters at Microsoft, we’ve seen teams seek to manage tens if not hundreds of different resources in their cluster. With 100 different goals, how do we even know if we’ve got a good answer?
Figure 2: There's a "best" aggregate answer in here somewhere... maybe.
Sure, we could just “overlay” these three solution spaces, pick the highest point and say we’re done. But it’s rarely that simple –with multiple types of resources being used by multiple services, we usually find that some resources are more important than others, so some of these peaks and valleys matter more than others. To some workloads a particular resource might not matter at all, or worse, could be detrimental! In addition to the different resources, there are many rules to deal with: certain workloads require placement on particular nodes, and may want to be near (or not near) other workloads. This takes our multidimensional search space and invalidates whole chunks of it.
Figure 3: Good luck hill climbing!
The above diagrams are very simple. They don’t show multiple metrics, they don't show metrics with different priorities, and they don't show what happens when rules and metric values change. In practice we’ve found that the search space is both huge (hundreds of different metrics and values, services with conflicting requirements, etc.), and occasionally quite constrained (services that can only run on specific nodes, services that have to run on the same nodes as other services, etc). Worse, things change frequently: which resources are important, how many resources a particular workload is actually consuming, what the workload’s constraints are, etc. all change during the runtime of the service. This means that decisions we make now may not be valid in the future, so we are always better off making small improvements now (if we can) and re-evaluating later. It also means that any static map of the environment is likely to be inefficient compared to what we could obtain (if only we try).
Figure 4: A highly constrained cluster
To summarize, the Service Fabric Resource Manager needs to:
- Handle many combinations of resources
- Deal with real resources (like CPU) or conceptual ones (like “MyOutstandingRequests”)
- Always find solutions (i.e., don’t get stuck in a valley)
- Find solutions even if the environment is highly constrained
- Satisfy strict rules (like placing services on specific machines) as well as optimal optimizations
- Not eat up all of the resources in the cluster trying to find a perfect solution or trying every possible combination; there’s just too many.
- Handle a highly dynamic environment (nodes failing and joining, services being created or deleted, load consumption changing frequently or never, rules changing, etc).
- Ensure that we don’t inadvertently make things worse
We need a shortcut.
Simulated Annealing as an Approach
In Service Fabric, we reduce the problem of searching through the N-dimensional (metric) search space to several runs of a Simulated Annealing algorithm. A run of the search inside Service Fabric’s Resource Manager proceeds something like this:
1) Evaluate the state of the world
The Resource Manager reads any recorded changes since the last run and tabulates them. For example, it notes if there are nodes that have entered or left the cluster, services that were created or deleted, changes in load, etc.
2) If the cluster is imbalanced, start a timer that we’ll use to quit the search after a while
3) Then, until that timer runs out:
a) Generate a random, valid move
b) Score the resulting arrangement of the cluster
c) If the score is sufficient, accept the move, otherwise reject it
4) When the timer runs out, compare the existing score in the cluster to the score of the new result
5) If the new arrangement is better than the current state, propose that solution
At this point the other components of Service Fabric swing into motion in order to make the existing arrangement in the cluster match the proposal. This could mean creating new replicas, shutting down others, swapping primaries and secondaries, etc.
Let’s go through these stages with an example. Let’s say your application is composed of two services: a stateful service with 3 partitions and a target replica set size of 3, and a stateless service instance with a target instance count of 3. This layout is shown in Figure 5.
Figure 5: An example application
Imagine you just added a new machine (Node6) to scale out. When we look inside the cluster, we see this:
Figure 6: An imbalanced cluster
Clearly this isn’t the best possible layout – some nodes have way more services than others and Node 4 is completely empty! Within seconds, the Cluster Resource Manager notices this too and starts going through the steps outlined above. Let’s take them one at a time:
1) “Evaluate the state of the world”
The Cluster Resource Manager notes the current state of the cluster (the positions of all the services), and figures out any changes since the last time it ran (in this case, the addition of Node6).
2) “If the cluster is imbalanced…”
An imbalanced cluster is one where the load is higher on some nodes than on others. We’ll talk more later about how to influence this, but by default “load” is simply the count of replicas and instances. So we’d say that the “Load” on Node3 is: “PrimaryCount:1, ReplicaCount:2, TotalCount:3”. By contrast, Node6 has zero load, whereas Node4 has “PrimaryCount:0, ReplicaCount:3, TotalCount:3”. Since the cluster is clearly unbalanced (Node4 has more secondaries than it should and Node6 is empty), we start the timer and begin the race to find a better arrangement.
Of course, it is not always so obvious. As the number of metrics and nodes increases it becomes very hard to eyeball a cluster and determine if things could be better. Thankfully, the Cluster Resource Manager can do the tally for us.
3) “Then, until that timer runs out…”
For the sake of discussion, let’s say we have 5 seconds and that each proposal and evaluation of a move takes one second whether we end up using it or not. In reality it’s much faster, of course, but let’s see where this gets us…
a) “Generate a random, valid move”
Just like in Simulated Annealing, this move has to be a “neighbor” to the current condition, so we can’t generate “moves” like “Create a Node!” because creating nodes isn’t a thing that the Cluster Resource Manager can do. Similarly, we can’t say “Move the Primary to a different node with no replicas on it” because in order to get a Primary replica you first need a Secondary replica. We might be able to generate “Move Secondary” and “Swap Primary” as two consecutive moves but we can’t generate a single move that captures both because that Cluster state isn’t adjacent to where we are right now.
Ok, let’s get started. We “roll the dice” and generate a move which is:
“Move Partition3:Secondary1 from Node2 to Node5”
While this change is adjacent to our current state, it is unfortunately also invalid since it would put two replicas from the same partition on the same node, a violation of one of our default rules. Note that the idea of a “valid” move is extensible – you can add more rules to be enforced.
Since this move was rejected, we roll the dice again and get:
“Swap Partition2:Primary (Node3) and Partition2:Secondary2 (Node2)”
This is a valid move, so let’s try incorporating it into the proposal. This gives us a new proposed solution from which we’ll generate all further moves.
Figure 7: Proposed cluster
b) “Score the resulting arrangement of the cluster”
We now take the current proposal (the accumulation of all accepted moves plus the most recent one), and score it. The scoring function is tunable but in our simplified example, we’ll define it as the average standard deviation of all metrics in the cluster, and say that the lower the score is, the better. Figure 8 shows the two cluster layouts and their respective scores:
Figure 8: Cluster scoring
There are other factors that go into a score, such as overall cost of each change, but we’ll ignore those for now. Looking at these two scores, we see that they’re the same, which is fine! Since the move is random, it could do almost anything (so long as it is valid). We’ll see what to do with the scores next.
c) “If the score is a sufficient improvement, accept the move, otherwise reject it”
For each change, we compare the score before and after a given proposed change and then decide if we want to keep it. What determines the threshold for when we accept it or not? Put differently, what is a “sufficient” improvement? It turns out, the acceptance is based mostly on the notion of “Temperature” in Simulated Annealing.
Recall that this is just the second proposed change that we’ve generated, and only the first that’s been valid in the overall run. At the beginning of a run we accept just about any valid change, even if it doesn’t result in an improved score (and even if it temporarily makes things worse). Later, bar for changes goes up, and each change will have to significantly improve the score for us to keep it. Thus, changes made early in the search are effectively random, while those at the end are basically greedy improvements. The random steps help us jump over the invalid gaps and not get stuck in local optima, while the hill climbing steps at the end ensure that we’re actually choosing changes that improve the layout of the cluster. At this point, for our second change, we accept the movement since we’re still fairly early into our search, so the bar for score improvements is still low.
4) “When the timer runs out, compare the existing score in the cluster to the score of the new result”
We do this for a few reasons. The first is to ensure that the result we came up with is actually better than what is currently in the cluster. The second is because (if we have time) we perform multiple runs of these steps and keep multiple possible end solution candidates. We can compare several of these candidates to see which is best vs. the initial cluster state, and which offers the best improvement relative to the cost of the changes. Let’s extend the example by generating a few more proposed changes. Assume that these three changes were accepted before we ran out of time:
“Move Partition2 Secondary1 from Node4 to Node6”
“Move Partition3 Secondary1 from Node2 to Node6”
“Move Stateless Instance3 from Node5 to Node1”
Figure 9 shows the final proposed cluster layout and the resulting score.
Figure 9: Final cluster proposal
5) “If the new arrangement is better than the current state, propose that solution”
This score of the final proposal is better than the initial cluster state so the changes are sent to the Failover Manager (or FM, another system service) for execution. The proposal is simple relative to the rest of the steps, since it is just the total difference between the current cluster state and the solution. So if we came up with a valid change, and then reversed it later, the FM sees no changes, whereas if a replica is moved from N1 to N2, and then from N2 to N3, the FM sees only one step, which is moving the replica from N1 to N3.
This design ticks all the boxes. It lets us:
- Explore an extremely large search space with many different goals combined in the simple notion of a “score” and whether a move is “valid”.
- Temporarily skip over invalid areas of the search space that are invalid, and yet still find answers which are good or better than where we started in most cases
- Ignore bad proposals and bad solutions while running frequently enough to make small improvements regularly. Things get better when it is possible and worth it to do so, and at worst they stay the same.
Wrapping It Up
In this post, we reviewed a few computer science concepts and then peeked into the brain of the Cluster Resource Manager to see how it uses Simulated Annealing to solve very complicated problems without consuming significant resources. The Cluster Resource Manager is an important part of Service Fabric essential to running your services smoothly.