Sometimes it's all in the numbers

Here's a little story about a problem I diagnosed back when I was working on sidewalk.com that I thought was interesting to share.

It turned out that when we released the Yellow Pages features of Sidewalk we had this weird thing going on where about once every 10 minutes we'd see this huge backlog of incoming queries which would then subside again, and then show up, about another 10 minutes later.  Not like clockwork, sometimes it was 9 minutes and sometimes it was 11.

I was looking at the data wondering what might cause such a hiccup... Investigation showed that what was happening was that when the hiccups occurred it was because you happened to hit a server that had all of its threads working on a “hard” query.  The hard queries could take upwards of 45 seconds to finish, so you definately notice the wait if there are no free threads available to work on “easy” queries which could be resolved in more like 200ms.

So then I thought, is anything going wrong here at all or is it just the math of it?  The hard queries seemed to be pretty much arriving randomly.  That's when the lightbulb went on...  Random independant arrivals?    This sounds like a Poisson Process. How many of these could I expect to be happening in a given minute anyway?  Well all I need to know is the number that happen in an average minute and that was easy to get from the logs... 5.5 hard queries per minute.

Now time to break out the excel POISSON function and make a quick table:

 

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

4.0

100.0% 98.2% 90.8% 76.2% 56.7% 37.1% 21.5% 11.1% 5.1% 2.1% 0.8% 0.3% 0.1% 0.0% 0.0%

4.5

100.0% 98.9% 93.9% 82.6% 65.8% 46.8% 29.7% 16.9% 8.7% 4.0% 1.7% 0.7% 0.2% 0.1% 0.0%

5.0

100.0% 99.3% 96.0% 87.5% 73.5% 56.0% 38.4% 23.8% 13.3% 6.8% 3.2% 1.4% 0.5% 0.2% 0.1%

5.5

100.0% 99.6% 97.3% 91.2% 79.8% 64.2% 47.1% 31.4% 19.1% 10.6% 5.4% 2.5% 1.1% 0.4% 0.2%

6.0

100.0% 99.8% 98.3% 93.8% 84.9% 71.5% 55.4% 39.4% 25.6% 15.3% 8.4% 4.3% 2.0% 0.9% 0.4%

6.5

100.0% 99.8% 98.9% 95.7% 88.8% 77.6% 63.1% 47.3% 32.7% 20.8% 12.3% 6.7% 3.4% 1.6% 0.7%

7.0

100.0% 99.9% 99.3% 97.0% 91.8% 82.7% 69.9% 55.0% 40.1% 27.1% 17.0% 9.9% 5.3% 2.7% 1.3%

The row we care about is the blue one.  The columns are the number of events occurring in a minute and the rows is the average number of events over a long period of time.  If we look at the row 5.5 and column 9, we'll see that it says 10.6% meaning that there is a 10.6% chance of having 9 or more events in any given minute given that the average number is 5.5.  If, for instance, the average had been only 5, then we'd go up a row to 6.8%.

Great. So why do I care? 

Well... I know another number.  I know that we had done scalability testing on our system and we got the maximum net throughput at 9 threads so that's how we had configured things.  Now with 9 threads running we get into trouble if all 9 of them are doing something hard, because then anything that comes along is going to be waiting behind a hard problem.  Ouch.  How likely is that to happen in any given minute?  It says right there, 10.6% of minutes will have this problem -- or about 1 in 10.  Double Ouch.

Now the good news.  Scalability testing had shown that total throughput only fell gently when increasing the number of threads.  Increasing to as high as 14 threads would only cause about a 15% degradation in total throughput due to contention in our system.

How does our chart look at 14 threads?

Wow, look at how nicely the probability falls off... In green, if the mean is 5.5 the likelihood of 14 hard queries in one minute is down at 0.2%.

One registry tweak later, we had our servers running at 14 threads and presto magico no more 10 minute hiccups. 

This is a classic case of trading off raw throughput for smoother delivery.  More threads is slower overall but users liked it better.