Throughput and Latency Considerations

You know a funny thing happened when I joined the CLR team a few years ago. After working in MSN 7 years and coming back to the developer division, they decided that they wanted me to work on performance on the desktop. I thought for sure they were going to ask me to work on server side stuff. Go figure.

Anyway, recently I've been looking at some server related pieces of the framework and so today I felt like I should write about some of the stuff I learned while on MSN.

As always, I'm only going to try to be approximately correct -- mostly in the interest of remaining brief and getting the main notions across without getting buried in myraid exceptions.

OK, on to the business.

When you are building some kind of web application. If you are care about customer perceptions of performance -- latency is king. In fact this is pretty much true on the client as well but in server applications, there's just no question. Latency is El Supremo.

So what are the chief sources of latency? Well let's itemize some in a typical configuration -- two server tiers plus client.

  • Network latency from the client's machine to the front end of the service (considerable if the client is on another continent)
  • Time waiting in the queue on the front end
  • Request parsing/preparation (computation)
  • Network latency to back end services
  • Back end processing (computation)
  • Back end processing (disk)
  • Network latency for the response from the back end
  • Processing of results on the front end (computation)
  • Network latency back to the user

It can be more complicated than the above but that's good enough to study for now.

OK so lots of sources of latency. Let's put in some times and see what these might look like, in round numbers, just as a for-instance.

Network latency from the client's machine 100ms
Time waiting in the queue on the front end 135ms
Request parsing/preparation (computation) 2ms
Network latency to back end services 1ms
Back end processing (computation) 1ms
Back end processing (disk) 7ms
Network latency for the response from the back end 1ms
Processing of results on the front end (computation) 3ms
Network latency back to the user 100ms
Total Latency  350ms

You may despair looking at these numbers: "How can I possibly expect to affect the users experience with my code?"

Request parsing/preparation (computation) 2ms
Processing of results on the front end (computation) 3ms
Total Front End Processing  5ms

"Look, the part I wrote is only responsible for 5ms of the 350. If I did my part in zero time it would still take 345ms. So great I can improve the system by a total of about 1.5%. Why do I care about performance of my code again?"

Wait, not so fast Charlie :)

Let's further assume that this beast of ours can run flat out at 100% CPU usage because it scales nicely (we're usually not so lucky but what the heck). It takes 5ms of compute time to process one request (and say we have only one CPU) so that's 1/200 of the available time. i.e. we could do 200 requests per second. Now to fully use the CPU we can see that while it's working there are a total of 15ms (2+1+1+7+1+3) isn't it amazing how the math is working out here? :) OK so my thread would be running 5ms of 15ms during its processing, so I need 3 threads to be busy all the time. And one last piece of fun before I sum it up in a table. Since 200 requests arrive per second, and we serve them in 15ms we just multiply those two to get the number of requests in the system at any given time. 200*.015 = 30 requests processed. We have 3 threads so 10 requests per thread, one of which is active 9 are queued. The processing time is 15ms so the average wait time is 9 * 15ms or 135ms.

You'd think I rigged this or something :)

*** Ian Griffiths pointed out that my math is wrong, I used .15s instead of .015s for the wait time, which makes this example not illustrate my point at all, I will post an updated example shortly.  Thank you Ian.   So much for rigging the numbers... how embarassing.

Requests/second 200
Processing time 15ms
Work time 5ms
Requests in flight 30
Threads 3
Queued requests per thread 9
Average wait time 135ms

So suppose we found a way to reduce our processing time by just 1ms. This can change things a lot!

  Before After
Requests/second 200 200
Processing time 15ms 14ms
Work time 5ms 4ms
Requests in flight 30 28
Threads 3 *2.8
Queued requests per thread 9 9
Average wait time 135ms 126ms
* 2.8 average ready to run threads

So we reduced our processing time by one thin millisecond and we got a 9ms reduction in wait time. Overall we saved 10ms off of the original 350. Not bad, we got a 10x multiplier.

Importantly, another thing we did was we actually increased the server capacity. Since the work time is now only 4ms instead of 5ms we could choose to run the system at 250 requests/second. If we did that, you can do the math, the latency actually goes *up* due to greater queue length. But it's not so bad, only 14ms worse. Why do we care? Well we're serving 25% more requests per server so we'd need only 80% the original number of machines to keep up with the load. That's a 20% cost savings potential... Not bad. Or we could increase the load to 215 requests per second on average per server and the users wouldn't notice, that's a good 7% cost savings right there.

So these things matter a lot. Sometimes I have meetings where I hear things like "the time is dominated by latency to my database anyway so it won't make any difference if I shave a few milliseconds off my compute time" -- well, it's true that it won't matter as much as if you were fully compute bound but you will affect overall latency which helps and you could make a big dent in throughput which hits operational costs. You do want to make money on this stuff right? :)

So don't write off the little compute costs quite so quickly.

But I'd be remiss if I didn't mention the other big thing you need to do. Look at those transmission times to the user 100ms in each direction. The best way to help with those is to look at what you're sending. Can you trim down your HTML? If you can then perhaps you can reduce transmission delays. That will reduce your compute costs, your egress costs, and give your users a better experience. Nobody likes fat pages. There are big, directly targettable numbers there.

And, it's actually worse in typical web cases because the response is often loaded with pictures.

So while the point of this particular article was that computational costs are more important than they first seem, don't lose sight of the fact that the biggest savings often come from directly targetting page weight. Although in retrospect I probably should have made the 100ms number to get to server smaller to be in line with the rest of the numbers. Well you can make your own scenario up if you don't like mine.