Source: On Random Sampling over Joins. Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, Sigmod 1999.

What?

- Random sampling as a primitive relational operator: SAMPLE(R, f) where R is the relation and f the sample fraction.
- SAMPLE(Q, f) is a tougher problem, where Q is a relation produced by a query
- In particular, focus on sampling over a join operator
- Can be generalize to arbitrarily deep join trees

Motivation:

- Data mining scenarios – CUBE, OLAP, stream queries- need to sample a query rather than evaluate it
- Statistical analysis when dealing with massive data
- Massively distributed computing (information storage/retrieval) scenarios

#### Details:

- Sampling Methods:

- With Replacement (WR),
- Without Replacement (WOR),
- Coin Flips (CF)

- Conversion between methods is straightforward, as per my previous note.
- Dimensions:

- Sequential stream (critical for efficiently) vs random access on a materialized relation
- Indexes, vs Stats vs no information
- Weighted vs Unweighted sample

- Note: Weighted, Sequential sample is the most general case

Sequential, unweighted: CF semantics – easy.

Sequential, unweighted, WOR – easy: Reservoir Sampling

Sequential, unweighted, WR – Algorithms:

__Black Box U1:__ Relation R with n tuples, get WR sample of size r

Need to know the size of n 🙁

Produces samples while processing, preserves input order, O(n) time, O(1) extra memory

x = r

i = 0

while(t = stream.Next())

{

Generate random variable X from BinomialDist(x, 1/(n-i))

result.Add( X copies of t)

x = x – X

i = i++

}

__Black Box U2: __

No need to know n . With some modification can preserve the order

Does not produce result till the end. O(n) time, O(r) space

N=0

Result[1..r]

while(t = stream.Next())

{

N++

for(j = 1 to r)

{

if(Rand.New(0,1) < 1/N)

Result[j] = t

}

}

return Result

.csharpcode, .csharpcode pre

{

font-size: small;

color: black;

font-family: consolas, “Courier New”, courier, monospace;

background-color: #ffffff;

/*white-space: pre;*/

}

.csharpcode pre { margin: 0em; }

.csharpcode .rem { color: #008000; }

.csharpcode .kwrd { color: #0000ff; }

.csharpcode .str { color: #006080; }

.csharpcode .op { color: #0000c0; }

.csharpcode .preproc { color: #cc6633; }

.csharpcode .asp { background-color: #ffff00; }

.csharpcode .html { color: #800000; }

.csharpcode .attr { color: #ff0000; }

.csharpcode .alt

{

background-color: #f4f4f4;

width: 100%;

margin: 0em;

}

.csharpcode .lnum { color: #606060; }

##### Weighted Sampling

The above two algorithms can be easily modified for the weighted case:

__Weighted U1__

x = r, i = 0

W = sum of w(t), the weights for each input tuple t

while(t = stream.Next() && x>0)

{

Generate random variable X from BinomialDist(x, w(t)/(W-i))

result.Add( X copies of t)

x = x – X

i = i+ w(t)

}

__Weighted U2__

W=0

Result[1..r]

while(t = stream.Next())

{

W = W + w(t)

for(j = 1 to r)

{

if(Rand.New(0,1) < w(t)/W)

Result[j] = t

}

}

return Result

##### The difficulty in Join Sampling

Example: R1 = {1, 2, …, 1000}, R2 = {1, 1, 1, ….. 1}. Unlikely that R1(1) will be sampled, and SAMPLE(R1) SAMPLE(R2) will contain no result

- SAMPLE does not commute with join
- Sample tuple t from R1 with probability proportional to |R2(t)|

__Algorithms__

__Algorithms__

Let m1(v) denote the number of tuples in R1 that contain value v in the attribute to be used in equi-join.

__Strategy Naive Sampling: produces WR samples__

__Strategy Olken Sample: produces WR samples __Requires indexes for R1 and R2

Let M be upper bound on m2(v) for all values A can take, which is essentially all rows in R2 (?)

while r tuples have not been produced

{

Randomly pick a tuple t1 from R1

Randomly pick a tuple t2 from A=t1.A ( R2 )

With probability m2(t2.A)/M, output t1 t2

}

__Strategy Stream Sample__ [Chaudhury, Motwani, Narasayya]

No information for R1, R2 has indexes/stats.

- Use a with-replacement strategy and get a sample WR (S1) from R1 WHERE tuple t (from R1) has weight m2(t.A)
- while(t1 = S1.next())

{

t2 = random sample from (SELECT t from R2 where t.A = t1.A)

output t1 t2

}

- ‘non-oblivious sampling’, where the
**distribution of R2 is used to bias the sample from R1** - What about R1 R2 R3?
- Pick non-uniform random sample for R1 R2 whose distribution depends on R3
- Sample from R1 using stats for R2 and R3.
- Using the same biasing idea to push down both operand relations
- Cross-dependent sampling strategy difficult

PingBack from http://www.biosensorab.org/2008/02/11/random-sampling-over-joins/