A simple random sampling strategy to produce a sample without replacement from a stream of data - that is, in one pass: O(N)

Want to sample s** **instances - uniformly at random without replacement - from a population size of n records, where n is not known.

Figuring out n would require 2 passes. Reservoir sampling achieves this in 1 pass.

A reservoir **R** here is simply an array of size **s**. Let **D** be data stream of size **n**

Algorithm:

- Store first s elements into R.
- for each element in position k = s+1 to n ,

- accept it with probability s/k
- if accepted, choose a random element from R to replace.

Partial analysis:

Base case is trivial. For the k+1st case, the probability a given element i with position <= k is in R is s/k. The prob. i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.

So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps)

= s/k * k/(k+1), which is s/(k+1).

So, when k+1 = n, any element is present with probability **s/n.**

#### Distributing Reservoir Sampling

It is very simple to distribute the reservoir sampling algorithm to n nodes.

Split the data stream into n partitions, one for each node. Apply reservoir sampling with reservoir size s, the final reservoir size, on each of the partitions. Finally, aggregate each reservoir into a final reservoir sample by carrying out reservoir sampling on them.

Lets say you split data of size n into 2 nodes, where each partition is of size n/2. Sub-reservoirs R1 and R2 are each of size s.

Probability that a record will be in sub-reservoir is:

s / (n/2) = 2s/n

The Probability that a record will end up in the final reservoir given it is in a sub-reservoir is: s/(2s) = 1/2.

It follows that the probability any given record will end up in the final reservoir is:

2s/n * 1/2 = **s/n**

PingBack from http://msdnrss.thecoderblogs.com/2008/02/05/reservoir-sampling/

In the June 2008 CTP, PLINQ aggregations are more powerful than they were in the December 2007 CTP. The

The distribution of reservoir sampling is not fair to all of the items. If you want a sample of k = 2 you will only be able to pick an item from each partition which is not alright. There must be a chance for, lets say, the two last items in the list to be picked.