Service Fabric Under the Hood: The Cluster Resource Manager (Part 1)


This post was authored by Matt Snider, a Program Manager on the Azure Service Fabric team.

This is the inaugural post in a series where we’ll go deep about the capabilities of Azure Service Fabric in a particular area. This is the first post in that series where we’ll talk about the Resource Manager.

The Resource Manager is a core component of Azure Service Fabric and is responsible for managing a lot of the complexity in your cluster. Fundamentally it is the Resource Manager (along with the Failover Manager, another system service we’ll talk about in another series of posts) which handles most of the day to day events of the cluster. The job of the Resource Manager is to respond to things like new nodes joining or leaving, new services getting created, and helping to move services during upgrades to provide those zero downtime guarantees. The Resource Manager also cares about ensuring that the cluster as a whole is optimized in order to provide the best environment for your services to run.

It enforces the rules you set on each service (we call these constraints), and lets you determine how hot or cold nodes should get and how to react when that happens. In this series of blog posts we’re going to talk through each of the major features of the Resource Manager and show you how to configure it. But for this first post, let’s start by talking about why we built it and a little bit about how it works.

 

Setting the Stage – A Brief Introduction to a Complex World

Traditionally managing IT systems or a set of services meant getting a few machines dedicated to those specific services or systems. Many major services were broken down into a “web” tier and a “data” or “storage” tier, maybe with a few other specialized components like a cache. Other types of applications would have a messaging tier where requests flowed in and out, connected to a work tier for any analysis or transformation necessary as a part of the messaging. Each part got a specific machines or a couple of machines dedicated to it: the database got a couple machines dedicated to it, the web servers a few. If a particular type of workload caused the machines it was on to run too hot, then you added more machines with that type of workload configured to run on it, or replace a few of the machines with larger machines. Easy. If a machine failed, that part of the overall application ran at lower capacity until the machine could be restored. Still fairly easy (if not necessarily fun).

Now however, let’s say you’ve found a need to scale out and have taken the containers or microservice plunge. Suddenly you find yourself with hundreds or even thousands of machines, dozens of different types of services, perhaps hundreds of different instances of those services, each with one or more instances or replicas for High Availability (HA).

Suddenly managing your environment is not so simple as managing a few machines dedicated to single types of workloads. Your servers are virtual and no longer have names (you have switched mindsets from pets to cattle after all), and machines aren’t dedicated to single types of workloads. Configuration is less about the machines and more about the services themselves.

As a consequence of this decoupling and breaking your formerly monolithic tiered app into separate services, you now have many more combinations to deal with. Who decides what types of workloads can run on specific hardware, or next to each other? When a machine goes down… what was even running there? Who is in charge of making sure it starts running again? Do you wait for the (virtual) machine to come back or do your workloads keep running? Is human intervention required to get things running again? And what about upgrades in this sort of environment? We’re going to need some help, and you get the sense that a hiring binge and trying to paper over the complexity with people is not the right answer.

What to do?

 

Introducing Orchestrators

An “Orchestrator” is the general term for a piece of software that helps administrators manage these types of deployments. Orchestrators are the components that take in requests like “I would like 5 copies of this service running in my environment” make it true, and try to keep it that way.

Orchestrators (not humans) are what swing into action when a machine fails or a workload terminates for some unexpected reason. Most Orchestrators do more than just deal with failure, such as helping with new deployments, handling upgrades, and dealing with resource consumption, but all are fundamentally about maintaining some desired state of configuration in the environment. You want to be able to tell an Orchestrator what you want and have it do the heavy lifting. Mesosphere via components like Chronos, Aurora, or Marathon, CoreOS’s Fleet, Docker Swarm, Kubernetes, and Service Fabric are all Orchestrators or have them built in. More are being created all the time as the complexities of managing real world deployments in different types of environments and conditions grow and change.

 

Orchestration as a Service

The job of the Orchestrator within a Service Fabric cluster is handled primarily by the Resource Manager. The Service Fabric Cluster Resource Manager is one of the System Services within Service Fabric and is automatically started up within each cluster.  Generally, the Resource Manager’s job is broken down into three parts:

1)      Enforcing Rules

2)      Optimizing Your Environment

3)      Assisting in Other Processes

Before we look at all of the capabilities that the Resource Manager has, let’s take a brief look at how it works.

 

The Azure Service Fabric Cluster Resource Manager

General Architecture and Implementation

Before we get too far in describing the Resource Manager and all of its features, let’s first talk a bit about what it really is and how it works.

What It Isn’t

In traditional N tier web-apps there was always some notion of a “Load Balancer”, usually referred to as a Network Load Balancer (NLB) or an Application Load Balancer (ALB) depending on where it sat in the networking stack. Some load balancers are Hardware based like F5’s BigIP offering, others are software based such as Microsoft’s NLB. In these architectures the job of load balancing is to make sure that all of the different stateless front end machines or the different machines in the cluster receive (roughly) the same amount of work. Strategies for this varied, from sending each different call to a different server, to session pinning/stickiness, to actual estimation and call allocation based on its expected cost and current machine load.

Note that this was at best the mechanism for ensuring that the web tier remained roughly balanced. Strategies for balancing the data tier were completely different and depended on the data storage mechanism, usually centering around data sharding, caching, database managed views and stored procedures, etc.

While some of these strategies are interesting, the Service Fabric Resource Manager is not anything like a network load balancer or a cache. While a NLB ensures that the front ends are balanced by moving traffic to where the services are running, the Service Fabric Resource Manager takes a completely different tack – fundamentally, Service Fabric moves services to where there is room for them, or to where they make sense based on other conditions.

For example, Service Fabric may move a set of services from nodes which are busy to nodes which are currently underutilized. Service Fabric could also move services away from a node which is about to be upgraded or which is overloaded due to a spike in consumption of resources by the services which were running on it. Because the Service Fabric Resource Manager is responsible for moving services around (not delivering network traffic to where services already are), it is more versatile and also contains additional capabilities for controlling where and how services are moved. We’ll talk about these capabilities shortly.

Architecture

In order to perform these functions, the Resource Manager must have several pieces of information. It has to know which services currently exist and the current amount of resources that each service is consuming. It has to know the actual capacity of the nodes in the cluster, and thus the amount of resources available both in the cluster as a whole and remaining on a particular node. We’ll have to deal with the fact that the resource consumption of a given service can change over time, as well as the fact that services, in reality, usually care about more than one resource. Across many different services there may be both “real” resources like memory and disk consumption that are commonly used across many different types of services, as well as resources that only a particular service cares about, for example “MyCurrentOutstandingRequests”. While memory on the node is real and common, it is unlikely that “MyCurrentOutstandingRequests” is anything like “YourCurrentOutstandingRequests” in terms of the resources that get consumed handling one on a given machine.

Further complicating things is the fact that the owners and operators of the cluster are occasionally different from the service authors, or at a minimum are the same people wearing different hats; for example when developing your service you know a few things about what it requires in terms of resources and how the different components should ideally be deployed, but as the person handling a live-site incident for that service in production you have a different job to do, and require different tools. In addition, neither the cluster nor the services themselves are a statically configured: the number of nodes in the cluster can grow and shrink, nodes of different sizes can come and go, and services can change their resource allocation, and be created and removed. Upgrades or other management operations can roll through the cluster, and of course things can fail at any time.

Our resource manager will have to know many things about the overall cluster itself, as well as the requirements of particular services. To accomplish this, in Service Fabric, we have both agents of the Resource Manager that run on individual nodes in order to aggregate local resource consumption information, and a centralized, fault-tolerant Resource Manager service that aggregates all of the information about the services and the cluster and reacts to changes based on the desired state configuration of the cluster and service. The fault tolerance is achieved via exactly the same mechanism that we follow for your services, namely replication of the service’s state to some number of replicas (in our larger internal clusters, this is usually 7).

Figure 1: General Resource Manager Functions

Let’s take a look at this diagram (Fig. 1) as an example. During runtime there are a whole bunch of changes which could happen: For example, let’s say there are some changes in the amount of resources services consume, some service failures, some nodes join and leave the cluster, etc. All the changes on a specific node are aggregated and periodically sent to the central Resource Manager service (1,2) where they are aggregated again, analyzed, and stored.  Every few seconds that central service looks at all of the changes, and determines if there are any actions necessary (3). For example, it could notice that nodes have been added to the cluster and are empty, and decide to move some services to those nodes. It could also notice that a particular node is overloaded, or that certain services have failed (or been deleted), freeing up resources on other nodes.

Let’s take a look at the next diagram (Fig. 2) and see what happens in this example: Let’s say that the Resource Manager determines that changes are necessary. It coordinates with other system services (in particular the Failover Manager) to make the necessary changes. Then the change requests are sent to the appropriate nodes (4). In this case, we presume that the Resource Manager noticed that Node 5 was overloaded, and so decided to move service B from N5 to N4. At the end of the reconfiguration (5), the cluster looks like this.

Figure 2: The Resource Manager reconfigures the clusters

Wrapping It Up

In this post we’ve been able to take a brief look at why the Service Fabric Resource Manager exists, explained how it is different from other “balancing” solutions you may have heard of or used before, and taken a peek under the hood to see how its components are organized. In the next post we’ll talk more about the challenges that a complex orchestration component faces and some of the solutions we’ve come up with. We’ll also start talking about some of the features of the Resource Manager. This is just the beginning, stay tuned!

Comments (2)

  1. Gary says:

    Hi Matt,

    It seems that the Service Fabric takes care of fault tolerance and load sharing of serveries running on various nodes. They all communicate through "fabric:/myApplication/MyService" , nameless "cattle" as you put it. However, consumers outside the cluster (say a Mobile app calling a Web API in the cluster) needs a machine name/ip address (pet). How does Service Fabric provides fault tolerance and load balancing in that scenario without the client needing to change machine hosting the WEB API service?  Thanks.

  2. Matt Snider says:

    Hi Gary,

    The answer to your question is twofold. The first is that the name that you refer to above "fabric:/x/y" is really just a pointer to the real current address of the service as it is running somewhere in the cluster. There's a different service in the cluster called naming that takes care of maintaining that mapping, and clients typically ask Naming for the real address corresponding to a given name. We call this "resolution" – clients can talk to Naming and get the real address in terms of ip address and port for a running service, and then connect to the service over whatever transport or communication mechanism they get back (i.e. someazurecdnsname.eastus.cloudapp.net/index.html might be returned when you resolve fabric:/MyApp/WebService). The second part is that if you did this, then you'd get passed through the Azure Load Balancer. Alternatively we also see people writing gateways that live within their cluster (so HA) which then accept any traffic and do the resolution for the clients instead. You can find more information starting on this page in our documentation : azure.microsoft.com/…/service-fabric-connect-and-communicate-with-services

    HTH

    Thanks -Matt

Skip to main content