Throughput and Latency Considerations: Errata

Article
07/25/2006

Well hats off to Ian Griffiths who pointed out that I had screwed up my math in my previous posting. When I went back to double check I found that I had made two fatal mistakes.

Now I went back and used a better technique which I present below but -- wince -- I haven't had this reviewed by anyone yet. So in the interest of getting something more correct I'm posting this now but I reserve the right to have screwed this up again in which case I'll post yet another update. Because I think it's more interesting I'm leaving the original posting intact and just posting this errata.

The idea was to show that the queue length tends to magnify any improvements you make in straight CPU costs because of course if there many people in the queue you have to wait for many requests before your turn comes and each one is getting the improvement.

Here's the example of a round-trip to a web server, restated with better math for the queue length as I explain below.

Cost Summary
Network latency from the client's machine	100ms
Time waiting in the queue on the front end	35ms
Request parsing/preparation (computation)	2ms
Network latency to back end services	1ms
Back end processing (computation)	1ms
Back end processing (disk)	7ms
Network latency for the response from the back end	1ms
Processing of results on the front end (computation)	3ms
Network latency back to the user	100ms
Total Latency	250ms

Now here's the part where I botched it. I should have gone with the regular queuing theory formulas but noooo... I tried to do simpler math on the quick (drat). I would have noticed my formulas were totally wrong except I also made a typo entering one of the time delays (double drat) and so it worked out almost like it's supposed to, but for all the wrong reasons.

So, here, I think, is how it's *really* supposed to work:

Using 3 threads this service could, in principle, run flat out at 200 requests per second. That will be the "service rate" (it's usually represented by the greek letter mu.) However if we tried to run it at that speed, we'd end up getting further and further behind because sometimes there would be random bursts of arrivals and we'd never have any surplus capacity to recover from them. So let's suppose we're willing to run at 90% at worst and we're doing a plan for that 90% case. That means we'll allocate enough servers so that at most 180 requests arrive per server. That's so-called the arrival rate, (usually represented by the greek letter lambda).

All righty.

The classic formulae are that the average number of items in the system are: lamba/(mu - lambda) which in this case is 180/(200-180) or 9. The average service time is given by the formula 1/(mu-lambda) in this case that's 1/(200-180) or 50ms. Since it takes 15ms to process one item we consider that 35ms of wait time and 15ms of processing on average. It isn't an exactly multiple of the processing time (15ms) because the queuing model is allowing for spurts and lulls.

I'll make a slightly different change than I did in the initial posting but basically I'm showing the same effect. And I hope I've got the formulas right this time :)

Item	Cost
Total Processing time	(2+1+1+7+1+3) = 15ms
Work time	(2+3) = 5ms
Threads	3
Max Requests/second	200
Max Utilization	90%
Allowed Requests/sec	180
Requests in flight (all threads)	180/(200-180) = 9
Average wait time	35ms
Average service time	1/(200-180) = 50ms
Network latency to client	200ms
Total latency	250ms

So now lets go ahead and do the same sort of experiment as we did in the last posting. To avoid hitting a situation where there are not enough requests on average to feed the threads we have I'll make a smaller perturbation (things get funny if you start having more threads than work and I don't want to complicate things with that case at this point). This time we're just going to shave half a millisecond off the compute time -- an even smaller improvement than in the original example. In the table below I'm showing two alternatives for how we could exploit that gain.

Item	Original	Improved 1	Improved 2
Total Processing time	15ms	14.5ms	14.5ms
Work time	5ms	4.5ms	4.5ms
Threads	3	3.2	3.2
Max Requests/second	200	222.2	222.2
Max Utilization	90%	90%	81%
Allowed Requests/sec	180	200	180
Requests in flight (all threads)	9	9	4.3
Average wait time	35ms	30.5ms	9.18ms
Average service time	50ms	45ms	23.68ms
Network latency to client	200ms	200ms	200ms
Total latency	250ms	245ms	223.68ms
Cost Savings	-	10%	0%
Observed Speedup	-	2%	12%

In the Improved 1 column we can see that if we keep utilization constant at 90% then our maximum throughput goes up to 222.2 req/sec. At 90% utilization the apparent throughput goes from 180 req/sec to 200 req/sec -- that translates to a 10% cost savings (we only need 9 machines for every 10 we used to have).

Alternatively, in the Improved 2 column, if we want to keep the allowed requests per second constant per server -- serving the same load we could run the server a little cooler now. We get the same allowable 180 requests per second at 81% now but of course running the server cooler with faster processing time means that the queue length goes down. In fact the average number of requests in the system falls to 4.3 and the average wait time goes to 9.18ms. Adding back the 200ms of latency to the client users still observe a 12% speedup! Not bad for a measly half a millisecond throughput improvement. Alternatively (not shown) we could have heated up the server a little bit, to 91% and got the same observed latency with an extra 1% cost savings beyond Improved 1.

So Improved 2 shows another factor that was absent from the first analysis: if you can run at lower utilization your average queue length stays lower.

The irony is I didn't really care that much about the specifics of the formulae; I was just trying to show the effect but gosh darn it if I didn't trip over my own shoelaces a lot here.

Throughput and Latency Considerations: Errata

Additional resources