Azure Auto Scaling -- 设计强伸缩性的云方案

云计算解决方案中,最大的一个特色就是强伸缩性。如果托管服务的工作流量骤增、骤减或者只需服役一段时间,Microsoft Azure可以提供自动化伸缩(Auto Scaling)的功能,Auto Scaling可以依据开发者定义的扩容(缩容)规则来监视运行在云端的托管服务,一旦托管服务的运行达到规则生效条件,Microsoft Azure即启动自动化扩容(缩容)动作,完成扩容(缩容),使其在保证在线上服务的高质量时,使用的云资源最少。

本博文以Azure Cloud Service为例,就常用的基于Auto Scaling的设计和实施进行小结和分析,供Azure开发(架构)者参考。主要包含以下几个部分:

1. 使用Azure管理主页(portal)进行Auto Scaling设计和实现。

2. 高级使用Auto Scaling。

3. Auto Scaling性能调优及关键参数。

 

使用Azure管理主页实现Auto Scaling

此方法非常简便,无需任何代码或编程操作,开发者登录Microsoft Azure管理门户https://manage.windowsazure.com,选择目标托管服务之后,在Scale页,开发者可以设置Auto Scaling。在默认情况下,Auto Scaling是关闭的(“SCALE BY METRIC”项是“NONE”),但开发者可以手动修改“INSTANCE COUNT”项,临时增加或者减少角色实例的数量。

若开发者选择使用CPU或者Queue Storage来实现Auto Scaling,可以将Scale By Metric置为CPU或者Queue,然后配置具体的Scaling细节,如下参考配置:

更多关于此页面中Scaling设置的参数可以参考配置旁边的?中的提示信息,也可以参考以下文章和实践手册:https://azure.microsoft.com/en-us/documentation/articles/cloud-services-how-to-scale/

 

 高级使用Auto Scaling

采用了上述方法实施Auto Scaling设计以后,开发者往往会发现这样一个问题,当外部流量很大,导致Cloud Service中的每个虚机内CPU使用率高于80%时,Scaling Up(自动化扩容)大约1小时以后才发生,这与预期的效果可能存在差异。对于此问题,需要详细了解一下Auto Scaling工作的机制和主要要点。

 

主要要点:

1. Role状态正常即指Role中的每一个虚机都正常工作

2. 默认情况下,CPU统计值来自于之前45分钟内该Role下所有虚机的CPU平均。

3. Azure平台中,虚机内部的性能指标,传输到Scaling模块有一定的时差(约15分钟)

鉴于此,可以预期:若0:00分设置基于CPU的Scaling方案,然后迅速将所有的虚机CPU使用率消耗到90%,需要持续将近45+15=60分钟才会有Scaling Up事件发生。

深入了解这个问题,还需要仔细看一下Auto Scaling工作所依赖的配置。使用上述方法在portal上实施Auto Scaling后开发者可以在management services里面找到添加的auto scaling内容。(操作类型为PutAutoscalingSetting)

AutoscaleSettings:
[
  {
    "Profiles": [
      {
        "Name": "Scale.WebApi.Cpu",
        "Capacity": {
          "Minimum": "1",
          "Maximum": "2",
          "Default": "1"
        },
        "Rules": [
          {
            "MetricTrigger": {
              "Name": "Percentage CPU",
              "Namespace": "CloudService",
              "Resource": "CloudService:test1205:Staging:WebRole1",
              "TimeGrain": "PT15M",
              "Statistic": "Average",
              "TimeWindow": "PT45M",
              "TimeAggregation": "Average",
              "Operator": "GreaterThan",
              "Threshold": 80.0
            },
            "ScaleAction": {
              "Direction": "Increase",
              "Type": "ChangeCount",
              "Value": "1",
              "Cooldown": "PT10M"
            }
          },
          {
            "MetricTrigger": {
              "Name": "Percentage CPU",
              "Namespace": "CloudService",
              "Resource": "CloudService:test1205:Staging:WebRole1",
              "TimeGrain": "PT15M",
              "Statistic": "Average",
              "TimeWindow": "PT45M",
              "TimeAggregation": "Average",
              "Operator": "LessThan",
              "Threshold": 60.0
            },
            "ScaleAction": {
              "Direction": "Decrease",
              "Type": "ChangeCount",
              "Value": "1",
              "Cooldown": "PT10M"
            }
          }
        ]
      }
    ],
    "SubscriptionId": "959bf13d-17fc-4cc2-9669-594×××××2b",
    "Source": "CloudService:test1205:Staging:WebRole1",
    "Version": "1.0",
    "Enabled": true
  }
]

Resource Name: CloudService:test1205:Staging

Resource Type: Autoscale

大部分的配置内容都来自于portal上的设置,都比较容易理解,其中有两个重要的参数不是来自于portal上的设置:TimeGrain和TimeWindow,TimeGrain表示Azure Auto Scaling计算平均值的样本时间,15分钟即表示CPU平均值按每15分钟做一次平均统计,如2:00~2:15得到一个CPU平均值,2:15~2:30得到一个CPU平均值。TimeWindow表示Azure Auto Scaling用于检验目标role的过去表现的时长。如TimeGrain=15和TimeWindow=45,即表示,过去45分钟内,若所得的CPU平均值均高于80%,即scaling up。

因此,在需要更高灵敏的自动化伸缩方案中,开发者需要对默认的Scaling规则进行优化,开发者可以通过代码的方式来配置TimeGrain和TimeWindow。注:以下代码需要在项目中使用Nuget添加Azure Management Service Library。

        static void Main(string[] args)
        {
            // Cloud Service and Role to be auto scaled.
            var cloudServiceName = "test1205";
            var isProduction = false;
            var roleName = "WebRole1";

            // Azure Subscription ID and Management Certificate Thumbprint.
            string subscriptionId = "959bf13d-17fc-4cc2-9669-5&&××××b";

            // Get the certificate from the local store.
            X509Certificate2 cert = GetCertificate();

            // Genereate a resource Id for the cloud service and role.
            var resourceId = AutoscaleResourceIdBuilder.BuildCloudServiceResourceId(cloudServiceName, roleName, isProduction);

            // Create the autoscale client.
            AutoscaleClient autoscaleClient = new AutoscaleClient(new CertificateCloudCredentials(subscriptionId, cert));

            //Create an Autoscale Profile
            AutoscaleSettingCreateOrUpdateParameters createParams = new AutoscaleSettingCreateOrUpdateParameters()
            {
                Setting = new AutoscaleSetting()
                {
                    Enabled = true,
                    Profiles = new List<AutoscaleProfile>()
                    {
                        new AutoscaleProfile()
                        {
                            Name = "Scale.WebApi.Cpu",
                            Capacity = new ScaleCapacity()
                            {
                                Default = "1",
                                Minimum = "1",
                                Maximum = "2"
                            },
                    Rules = new List<ScaleRule>()
                        }
                    }
                }
            };

            //Create a ScaleRule to Scale Up
            var cpuScaleUpRule = new ScaleRule()
            {
                // Define the MetricTrigger Properties
                MetricTrigger = new MetricTrigger()
                {
                    MetricName = "Percentage CPU",
                    MetricNamespace = "",
                    MetricSource = AutoscaleMetricSourceBuilder.BuildCloudServiceMetricSource(cloudServiceName, roleName, isProduction),
                    TimeGrain = TimeSpan.FromMinutes(5),
                    TimeWindow = TimeSpan.FromMinutes(15),
                    TimeAggregation = TimeAggregationType.Average,
                    Statistic = MetricStatisticType.Average,
                    Operator = ComparisonOperationType.GreaterThan,
                    Threshold = 60.0
                },
                // Define the ScaleAction Properties
                ScaleAction = new ScaleAction()
                {
                    Direction = ScaleDirection.Increase,
                    Type = ScaleType.ChangeCount,
                    Value = "1",
                    Cooldown = TimeSpan.FromMinutes(10)
                }
            };

            //Create a ScaleRule to Scale Down
            var cpuScaleDownRule = new ScaleRule()
            {
                // Define the MetricTrigger Properties
                MetricTrigger = new MetricTrigger()
                {
                    MetricName = "Percentage CPU",
                    MetricNamespace = "",
                    MetricSource = AutoscaleMetricSourceBuilder.BuildCloudServiceMetricSource(cloudServiceName, roleName, isProduction),
                    TimeGrain = TimeSpan.FromMinutes(5),
                    TimeWindow = TimeSpan.FromMinutes(15),
                    TimeAggregation = TimeAggregationType.Average,
                    Statistic = MetricStatisticType.Average,
                    Operator = ComparisonOperationType.LessThan,
                    Threshold = 25.0
                },
                // Define the ScaleAction Properties
                ScaleAction = new ScaleAction()
                {
                    Direction = ScaleDirection.Decrease,
                    Type = ScaleType.ChangeCount,
                    Value = "1",
                    Cooldown = TimeSpan.FromMinutes(10)
                }
            };

            // Add the rules to the profile
            createParams.Setting.Profiles[0].Rules.Add(cpuScaleUpRule);
            createParams.Setting.Profiles[0].Rules.Add(cpuScaleDownRule);

            // Apply the settings in Azure to this resource
            autoscaleClient.Settings.CreateOrUpdate(resourceId, createParams);

            //Get Auto Scale Settings if you need to check what is the current scaling settings
            AutoscaleSettingGetResponse setting = autoscaleClient.Settings.Get(resourceId);
            Console.WriteLine("existing settins: " + setting.ToString());

        }

        public static X509Certificate2 GetCertificate()
        {
            var certByteData = File.ReadAllBytes(@"E:\cie-jianw-managecert-pw1.pfx");
            X509Certificate2 certificate = new X509Certificate2(certByteData, "password");
            return certificate;
        }

运行完上述代码之后,开发者可以在portal上的management service中,确认自定义配置已生效:

AutoscaleSettings:
[
  {
    "Profiles": [
      {
        "Name": "Scale.WebApi.Cpu",
        "Capacity": {
          "Minimum": "1",
          "Maximum": "2",
          "Default": "1"
        },
        "Rules": [
          {
            "MetricTrigger": {
              "Name": "Percentage CPU",
              "Namespace": "CloudService",
              "Resource": "CloudService:test1205:Staging:WebRole1",
              "TimeGrain": "PT5M",
              "Statistic": "Average",
              "TimeWindow": "PT15M",
              "TimeAggregation": "Average",
              "Operator": "GreaterThan",
              "Threshold": 60.0
            },
            "ScaleAction": {
              "Direction": "Increase",
              "Type": "ChangeCount",
              "Value": "1",
              "Cooldown": "PT10M"
            }
          },
          {
            "MetricTrigger": {
              "Name": "Percentage CPU",
              "Namespace": "CloudService",
              "Resource": "CloudService:test1205:Staging:WebRole1",
              "TimeGrain": "PT5M",
              "Statistic": "Average",
              "TimeWindow": "PT15M",
              "TimeAggregation": "Average",
              "Operator": "LessThan",
              "Threshold": 25.0
            },
            "ScaleAction": {
              "Direction": "Decrease",
              "Type": "ChangeCount",
              "Value": "1",
              "Cooldown": "PT10M"
            }
          }
        ]
      }
    ],
    "SubscriptionId": "959bf13d-17fc-4cc2-9669-59××××ac2b",
    "Source": "CloudService:test1205:Staging:WebRole1",
    "Version": "1.0",
    "Enabled": true
  }
]

Resource Name: CloudService:test1205:Staging

Resource Type: Autoscale

在此配置下,scaling up将会在约 15分钟(time window) + 15分钟(后端性能数据延时) = 30分钟后发生。

 

Auto Scaling性能调优及关键参数

针对上述使用代码进行的自定义Scaling配置,其运行过程如以下截图:

Azure中的Monitor数据较实际时间落后了将近15分钟。

鉴于Azure后端对目标虚机的性能采用传输到Scaling模块需要一定的时间(经反复测试,约15~25分钟),一个较保险的TimeWindow的推荐值是25分钟。若TimeWindow=15但数据延时是20分钟时,Auto Scaling会无法得到最新一次的数据(去平均值失败),即影响Auto Scaling的表现。TimeGrain的最小单位是5分钟,对于CPU类型的Scaling,可以取最小值,从而得到最快的表现。

基于上述Auto Scaling的讨论以及实际工作的场景,一个常见的伸缩性方案设计如下:

另外,若开发项目中,对于Scaling要求实时响应,如CPU超过80%,立即Scaling Up的需求,开发者可以考虑结合前面推荐的WAD自定义分析https://blogs.msdn.com/b/jianwu/archive/2014/09/23/windows-azure-storage-wad.aspx 和 REST操作https://msdn.microsoft.com/en-us/library/azure/ee460809.aspx来实现更高级的Scaling方案。