Well hats off to Ian Griffiths who pointed out that I had screwed up my math in my previous posting. When I went back to double check I found that I had made two fatal mistakes.

Now I went back and used a better technique which I present below but — wince — I haven’t had this reviewed by anyone yet. So in the interest of getting something more correct I’m posting this now but I reserve the right to have screwed this up again in which case I’ll post yet another update. Because I think it’s more interesting I’m leaving the original posting intact and just posting this errata.

The idea was to show that the queue length tends to magnify any improvements you make in straight CPU costs because of course if there many people in the queue you have to wait for many requests before your turn comes and each one is getting the improvement.

Here’s the example of a round-trip to a web server, restated with better math for the queue length as I explain below.

Cost Summary | |

Network latency from the client’s machine | 100ms |

Time waiting in the queue on the front end | 35ms |

Request parsing/preparation (computation) | 2ms |

Network latency to back end services | 1ms |

Back end processing (computation) | 1ms |

Back end processing (disk) | 7ms |

Network latency for the response from the back end | 1ms |

Processing of results on the front end (computation) | 3ms |

Network latency back to the user | 100ms |

Total Latency |
250ms |

Now here’s the part where I botched it. I should have gone with the regular queuing theory formulas but noooo… I tried to do simpler math on the quick (drat). I would have noticed my formulas were totally wrong except I also made a typo entering one of the time delays (double drat) and so it worked out almost like it’s supposed to, but for all the wrong reasons.

So, here, I think, is how it’s *really* supposed to work:

Using 3 threads this service could, in principle, run flat out at 200 requests per second. That will be the “service rate” (it’s usually represented by the greek letter mu.) However if we tried to run it at that speed, we’d end up getting further and further behind because sometimes there would be random bursts of arrivals and we’d never have any surplus capacity to recover from them. So let’s suppose we’re willing to run at 90% at worst and we’re doing a plan for that 90% case. That means we’ll allocate enough servers so that at most 180 requests arrive per server. That’s so-called the arrival rate, (usually represented by the greek letter lambda).

All righty.

The classic formulae are that the average number of items in the system are: lamba/(mu – lambda) which in this case is 180/(200-180) or 9. The average service time is given by the formula 1/(mu-lambda) in this case that’s 1/(200-180) or 50ms. Since it takes 15ms to process one item we consider that 35ms of wait time and 15ms of processing on average. It isn’t an exactly multiple of the processing time (15ms) because the queuing model is allowing for spurts and lulls.

I’ll make a slightly different change than I did in the initial posting but basically I’m showing the same effect. And I hope I’ve got the formulas right this time ðŸ™‚

Item | Cost |

Total Processing time | (2+1+1+7+1+3) = 15ms |

Work time | (2+3) = 5ms |

Threads | 3 |

Max Requests/second | 200 |

Max Utilization | 90% |

Allowed Requests/sec | 180 |

Requests in flight (all threads) | 180/(200-180) = 9 |

Average wait time | 35ms |

Average service time | 1/(200-180) = 50ms |

Network latency to client | 200ms |

Total latency | 250ms |

So now lets go ahead and do the same sort of experiment as we did in the last posting. To avoid hitting a situation where there are not enough requests on average to feed the threads we have I’ll make a smaller perturbation (things get funny if you start having more threads than work and I don’t want to complicate things with that case at this point). This time we’re just going to shave half a millisecond off the compute time — an even smaller improvement than in the original example. In the table below I’m showing two alternatives for how we could exploit that gain.

Item |
Original |
Improved 1 |
Improved 2 |

Total Processing time | 15ms | 14.5ms | 14.5ms |

Work time | 5ms | 4.5ms | 4.5ms |

Threads | 3 | 3.2 | 3.2 |

Max Requests/second | 200 | 222.2 | 222.2 |

Max Utilization | 90% | 90% | 81% |

Allowed Requests/sec | 180 | 200 | 180 |

Requests in flight (all threads) | 9 | 9 | 4.3 |

Average wait time | 35ms | 30.5ms | 9.18ms |

Average service time | 50ms | 45ms | 23.68ms |

Network latency to client | 200ms | 200ms | 200ms |

Total latency |
250ms |
245ms |
223.68ms |

Cost Savings | – |
10% | 0% |

Observed Speedup | – |
2% | 12% |

In the **Improved 1** column we can see that if we keep utilization constant at 90% then our maximum throughput goes up to 222.2 req/sec. At 90% utilization the apparent throughput goes from 180 req/sec to 200 req/sec — that translates to a 10% cost savings (we only need 9 machines for every 10 we used to have).

Alternatively, in the **Improved 2** column, if we want to keep the allowed requests per second constant per server — serving the same load we could run the server a little cooler now. We get the same allowable 180 requests per second at 81% now but of course running the server cooler with faster processing time means that the *queue length goes down*. In fact the average number of requests in the system falls to 4.3 and the average wait time goes to 9.18ms. Adding back the 200ms of latency to the client users still observe a 12% speedup! Not bad for a measly half a millisecond throughput improvement. Alternatively (not shown) we could have heated up the server a little bit, to 91% and got the same observed latency with an extra 1% cost savings beyond **Improved 1**.

So **Improved 2 **shows another factor that was absent from the first analysis: if you can run at lower utilization your average queue length stays lower.

The irony is I didn’t really care that much about the specifics of the formulae; I was just trying to show the effect but gosh darn it if I didn’t trip over my own shoelaces a lot here.

The fact that the optimal number of threads to run is changing is further complicating things but I’m basically disregarding that for now. I mentioned one of the interactions — that there might not be enough items in the queue — but of course another is scheduling overhead which I’m disregarding entirely. Not to mention the presence of critical sections in the code which might prevent us from scaling linearly.

The only real point here is that small computational benefits can actually end up being bigger than they appear at first because of the side-benefits they bring.

I’ll post again after I’ve had someone double check the corrected model for accuracy such as it is.

OK I’ve had a couple people look at this and I’m pretty sure it’s right now. Plus or minus the simplification I made in modelling the multi-threading that’s going on. Really you could get rid of that and it wouldn’t affect the concept I’m trying to illustrate. Just disregard the delay to the back end in my example and it’s a straight queue (an M/M/1 queue).

If it’s wrong now at least I can say other smart people also got it wrong ðŸ™‚