Share via


Converting Between Random Sampling Methods

 

Sampling f fraction out of n records:

  1. Sampling with replacement
    1. Sample is a multi-set of fn records. Any record could be samples multiple times.
  2. Sampling without replacement
    1. Each successive sample is uniformly at random from the remaining records
  3. Independent Coin flips: choose a record with probability f. The sample has s distinct records where s is a random variable with Binomial distribution B(n, f), and has expectation of fn elements in the set.
Conversions:
  • Sample with replacement --> Sample without replacement
    • Check upon each new selection and reject duplicates
  • Coin flips --> Sample without replacement
    • Sample a larger fraction so that at least f fractoin records are present in the sample (Chernoff bounds). Reject required number of samples to ensure we get exactly f samples.
  • Sample without replacement --> Sample with replacement
    • Sample with replacement from "without replacement" sample by using corresponding probabilities for duplication
  • Sample with replacement --> Coin flips sample
    • Impossible
  • Sample without replacement --> Coin flips sample
    • Impossible

Last two can be seen by considering an adversarial case. Under the binomial distribution, there is a non-zero probability of sampling the entire population. Since any sampled subset can never recreate the entire population, S with R and S without R can never be converted to the coin flips semantics.

Source: S. Chaudhuri, R. Motwani, V. Narasayya; On random Sampling over Joins.