Monte Carlo Method to simulate Data Center Consolidation

A server consolidation project represents a difficult decision problem. The challenge is to provide the optimal amount of resources to host your existing workload. If you design your new environment too small your users will blame you for bad performance. If you size it too large you waste a significant amount of your IT Budget. Most CIOs tend to go for the latter and oversize their new system.

There is a third option: You can use a Monte Carlo Simulation to estimate the consolidated resource consumption.

Its important to highlight that the following method can`t tell you your future resource demand and you might not be able to change the size of your environment later. For better flexibility you should consider cloud computing.

But if you have no other option or you run hybrid scenarios here is a method to estimate your resource demand:

Let`s say you want to consolidate 18 servers to a single new machine. How many CPU cores are required to keep performance steady?

Step 1) For each server in your current environment measure the CPU utilization and plot a discrete density function. In other words: count how often you need 1 core, 2 cores, 3 cores and so forth. Draw a histogram to get a feeling for the current workload.

Here is an example:

On the x-axis you have the number of logical cores as histogram buckets. On the y-axis you count the number of samples in each bucket.

(Hint: If this is a SQL Server consolidation you can use the performance collector warehouse to sample CPU utilization.)

Step 2) Sum up the densities to a cumulative probability distribution for each instance:

You can use SQL Windowing functions for that. The result looks like this:

Select * from CPUCumulativeMaterialized

Step 3) Create a function that allows you to draw a random CPU utilization from a instance distribution:

Create Function getSimulatedLogicalCPUUtilization(@instance varchar(255), @r float) returns int

AS begin

declare @coresused int

Select @coresused = c1.CoresUsed from CPUCumulativeMaterialized c1 l eft join CPUCumulativeMaterialized c2 on c1.instance_name = c2.instance_name and c1.coresused = c2.CoresUsed +1 where c1.instance_name = @instance and @r between Coalesce(c2.CumulativeProb,0) and c1.CumulativeProb

RETURN @coresused

end

Step 4) Call the function a million times for each instance:

declare @simulation int= 0

while @simulation < 1000000

begin

Insert Into MonteCarloSimulation Select @simulation,[Column 0],COALESCE(dbo.getSimulatedLogicalCPUUtilization(UPPER([Column 0]),RAND(CHECKSUM(NEWID())) ),0) from LogicalCores

set @simulation += 1

end

Step 5) Sum up the CPU utilization per simulation:

Select SimulationID,SUM(CPUCount) as 'CPUCount' from MonteCarloSimulation group by SimulationID order by SimulationID

Step 6) Create a new histogram to visualize the predicted consolidated CPU utilization:

Interpret the result:

The diagram tells you that if you measure the CPU utilization on the new hypothetical system a hundred times you will find that a 16 core machine (=32 logical processors) would cover your need in 98 percent of the cases. There are outliers where it might happen that your system requires more than 55 logical processors but that’s very unlikely.

It seems that the combined probabilities follow a normal distribution so you can calculate the expectation value(mue) and the standard deviation(sigma). Plug in your Z Value(a number of logical cores) and proposed the following table to the decision maker:

 Logical Cores Probability of being able to handle the workload without slowdown 18 0.412161 32 0.987781 40 0.999755 48 0.999999

The numbers represent the worst case scenario. There are many more factors to consider: If you cluster your solution your workload will be divided by the number of machines. The way we added the logical cores is pretty conservative as it would mean nothing will be shared. (It would require some experiments to figure out how much CPU would be left when you add two systems together.) Another aspect is that the new cores might not have the same clock frequency. In addition to that it might make sense to run a separate simulation for business hours only. Nightly batch and maintenance jobs drive resource utilization where it might be acceptable when things take longer.

The model also assumes that other factors such as the disk subsystem, network and memory remains constant.