Throughput and Latency Considerations: Errata

Well hats off to Ian Griffiths who pointed out that I had screwed up my math in my previous posting. When I went back to double check I found that I had made two fatal mistakes.

Now I went back and used a better technique which I present below but -- wince -- I haven't had this reviewed by anyone yet. So in the interest of getting something more correct I'm posting this now but I reserve the right to have screwed this up again in which case I'll post yet another update. Because I think it's more interesting I'm leaving the original posting intact and just posting this errata.

The idea was to show that the queue length tends to magnify any improvements you make in straight CPU costs because of course if there many people in the queue you have to wait for many requests before your turn comes and each one is getting the improvement.

Here's the example of a round-trip to a web server, restated with better math for the queue length as I explain below.

Cost Summary
Network latency from the client's machine 100ms
Time waiting in the queue on the front end 35ms
Request parsing/preparation (computation) 2ms
Network latency to back end services 1ms
Back end processing (computation) 1ms
Back end processing (disk) 7ms
Network latency for the response from the back end 1ms
Processing of results on the front end (computation) 3ms
Network latency back to the user 100ms
Total Latency  250ms

Now here's the part where I botched it. I should have gone with the regular queuing theory formulas but noooo... I tried to do simpler math on the quick (drat). I would have noticed my formulas were totally wrong except I also made a typo entering one of the time delays (double drat) and so it worked out almost like it's supposed to, but for all the wrong reasons.

So, here, I think, is how it's *really* supposed to work:

Using 3 threads this service could, in principle, run flat out at 200 requests per second. That will be the "service rate" (it's usually represented by the greek letter mu.) However if we tried to run it at that speed, we'd end up getting further and further behind because sometimes there would be random bursts of arrivals and we'd never have any surplus capacity to recover from them. So let's suppose we're willing to run at 90% at worst and we're doing a plan for that 90% case. That means we'll allocate enough servers so that at most 180 requests arrive per server. That's so-called the arrival rate, (usually represented by the greek letter lambda).

All righty.

The classic formulae are that the average number of items in the system are: lamba/(mu - lambda) which in this case is 180/(200-180) or 9. The average service time is given by the formula 1/(mu-lambda) in this case that's 1/(200-180) or 50ms. Since it takes 15ms to process one item we consider that 35ms of wait time and 15ms of processing on average. It isn't an exactly multiple of the processing time (15ms) because the queuing model is allowing for spurts and lulls.

I'll make a slightly different change than I did in the initial posting but basically I'm showing the same effect. And I hope I've got the formulas right this time :)

Item Cost
Total Processing time (2+1+1+7+1+3) = 15ms
Work time (2+3) = 5ms
Threads 3
Max Requests/second 200
Max Utilization 90%
Allowed Requests/sec 180
Requests in flight (all threads) 180/(200-180) = 9
Average wait time 35ms
Average service time 1/(200-180) = 50ms
Network latency to client 200ms
Total latency 250ms

So now lets go ahead and do the same sort of experiment as we did in the last posting. To avoid hitting a situation where there are not enough requests on average to feed the threads we have I'll make a smaller perturbation (things get funny if you start having more threads than work and I don't want to complicate things with that case at this point). This time we're just going to shave half a millisecond off the compute time -- an even smaller improvement than in the original example. In the table below I'm showing two alternatives for how we could exploit that gain.

Item    Original    Improved 1  Improved 2
Total Processing time 15ms 14.5ms 14.5ms
Work time 5ms 4.5ms 4.5ms
Threads 3 3.2 3.2
Max Requests/second 200 222.2 222.2
Max Utilization 90% 90% 81%
Allowed Requests/sec 180 200 180
Requests in flight (all threads) 9 9 4.3
Average wait time 35ms 30.5ms 9.18ms
Average service time 50ms 45ms 23.68ms
Network latency to client 200ms 200ms 200ms
Total latency 250ms 245ms 223.68ms
Cost Savings - 10% 0%
Observed Speedup - 2% 12%

In the Improved 1 column we can see that if we keep utilization constant at 90% then our maximum throughput goes up to 222.2 req/sec. At 90% utilization the apparent throughput goes from 180 req/sec to 200 req/sec -- that translates to a 10% cost savings (we only need 9 machines for every 10 we used to have).

Alternatively, in the Improved 2 column, if we want to keep the allowed requests per second constant per server -- serving the same load we could run the server a little cooler now. We get the same allowable 180 requests per second at 81% now but of course running the server cooler with faster processing time means that the queue length goes down. In fact the average number of requests in the system falls to 4.3 and the average wait time goes to 9.18ms. Adding back the 200ms of latency to the client users still observe a 12% speedup! Not bad for a measly half a millisecond throughput improvement. Alternatively (not shown) we could have heated up the server a little bit, to 91% and got the same observed latency with an extra 1% cost savings beyond Improved 1.

So Improved 2 shows another factor that was absent from the first analysis: if you can run at lower utilization your average queue length stays lower.

The irony is I didn't really care that much about the specifics of the formulae; I was just trying to show the effect but gosh darn it if I didn't trip over my own shoelaces a lot here.