A friend of mine who works in a big Bay Area company recently asked me this question. “There is an infinite stream of numbers and you have to provide a median of those numbers a any given time.” At first it sounds impossible (BTW, do not confuse median with mean or average). When asking further about the problem domain he provided me some more information that helped me solve the problem in couple of minutes. I am not going to give you the answer, but I will provide you the domain information. The infinite stream of numbers is actually the response time of pages served by the web server. At any time you are supposed to find median response time. The obvious limited resource constraint applies.

First, my understanding of a median of a sample (X_i) for i=1..n – it’s the number that separates half of the samples (i..e half of the numbers are smaller, and the other half are bigger). On infinite sequences, the things become more interesting – it really depends on how the sequence converges to a limit.

Let’s assume that the sequence (X_i) converges to a single point X. Let’s note the median as the number M. In that case, M=X because (1) we have an infinity of points and (2) if X < M then M separates two intervals – one containing an infinity of points, and another one containing a finite set of points, which is absurd.

If there is no defined convegence, then it’s easy. Let’s define X_sup the maximum convergence limit, and X_inf the minimum. (X_sup can be even +infinite or finite, and X_inf can be -infinite or finite). Then, any point between X_sup and X_inf is a median because it separate two infinite subsequences of the original sequence.

Given that computing a median requires a holistic view of the data, wouldn’t the solution require clustering or sampling?

>>> Let’s assume that the sequence (X_i) converges to a single point X. Let’s note the median as the number M. In that case, M=X because (1) we have an infinity of points and (2) if X < M then M separates two intervals – one containing an infinity of points, and another one containing a finite set of points, which is absurd.

Actually, I was wrong. The median makes sense only if (X_n) contains both:

1) a monotonically increasing subsequence (X_n_i) that converges to X, and

2) a monotonically dcreasing subsequence (X_m_i) that converges to X.

In general, if we define:

X_inf = inf{ lim(X_n_i), where (X_n_i) is a monotonically increasing convergent subsequence of (X_n)}

and

X_sup = sup{ lim(X_n_i), where (X_n_i) is a monotonically decreasing convergent subsequence of (X_n)}.

Now, if both X_inf and X_sup are well-defined, then the median can be any number in the [X_inf, X_sup] interval.

Thanks, Adi

The general idea is to recognize that the distribution of these numbers will be a bell curve. We are trying to approximate the point on the x axis such that the left and right side have equal area/distribution. Since there could be infinite points between the min and max range you would need to sample them, divide them in intervals and approximate the area.

The simple solution is to divide the min-max range in to equal intervals and then count the numbers in these intervals. At any time to find the median start from the min-max range and do the left-right balances approach to approximate the median. The problem reduces to finding the range, the min-max limits. Let’s talk about it in a moment.

You can improve the accuracy of the approximation by using the smaller interval near the center of the curve. Since you do not know the center initially you can observe the distribution and adjust your intervals over time. Every time you change the interval limits you distribute the counts in proportion of the interval division.

One other important thing in solving this problem is to find out the min-max range and given the domain knowledge this is not hard. The web server response can never be negative hence the min limit is 0. Almost all of the web applications will have max limit on the response time typically 7-8 seconds, if they do not have one, the browser will time out. So we also know the max limit, now it is possible that you might get number that is greater than max, but that’s ok. It shows that something is wrong and you must raise an alert. So you have one bucket for everything that is greater that max limit.

The other interesting solution to the problem is to start with assuming no limits and then adjusts them as you see the numbers.

Vipul,

Your algorithm won’t work for most of the cases. Consider for example the infinite sequence of alternating numbers: 0,1,0,1,0,1,0,1,0,1…

It is obvious that any number between 0 and 1 is a median, correct? (in the sense that "splits" the sequence in two infinite subsequences)

Thanks, Adi

Adi,

As I pointed out in the question the crucial thing to solving any problem practically is recognizing the problem domain. The stream you pointed does not represent the response time of the web server.

Hope that helps.

Hi Vipul. Modi

Would you mind mentioning the answer to this problem. ?

Hi,

I like the approach you propose (coarse grain view) and I would like to understand a little better what you mean when you say:

"""The simple solution is to divide the min-max range in to equal intervals and then count the numbers in these intervals. At any time to find the median start from the min-max range and do the left-right balances approach to approximate the median. The problem reduces to finding the range, the min-max limits. Let’s talk about it in a moment."""

I did not understand very well the subsequent explanation, if you have time could you re-detail it a little bit. Thanks.

I think the problem was very unscientific in it’s form 🙂 It’s stated that calculating the median of an infinite series has been solved, when in fact a whole bunch of (hidden) assumptions are made, and end up solving something that is not even close to what you started with. What you end up with, which should be stated in the "assignment" is an approximation of the median, with a certain accuracy, the accuracy requirement is not stated either, of a range limited series, which is not infinte (since you obviously have to do something about the count for the bins when they approach "infinity", dividing will work for a while, but not infinetly, depending on the distribution. And how do you KNOW the distribution will be normal? What is the reasoning behind that assumption 🙂

Can any body give a solution of this problem?If I do a random sampling on the numbers and find the median then what is probability that it will be the median of the original data stream? IS there any standard randomized algorithm for this problem?

class MedianFinder

{

map<int, int> count;

int total;

void process(int n)

{

count[n]++;

total++;

}

int getMedian()

{

int pos = (total – 1) / 2;

foreach(int key, count.keys)

{

if (pos < count[key])

return key;

pos -= count[key];

}

}

}

Enjoy,

Andrew.

—

Andrew Tomazos <andrew@tomazos.com> <http://www.tomazos.com>

Share a blog on this topic: codercareer.blogspot.com/…/no-30-median-in-stream.html