Many distance related questions (nearest neighbor, closest x, ..) can be answered more efficiently by using locality sensitive hashing, where the main idea is that similar objects hash to the same bucket.
Probability of collision higher for similar objects
Hash data using several LSH functions
At query time, get all objects in the bucket where query object hashes to. Then compute similarity and find the closest one.
Main point: LSH reduces the candidates for similarity comparison
Example of LSH function:
Set S of d integers.
h(S) = list of integers at k random indexes (r)
S = {12, 9, 13, 98 , 2, 4, 43, 21, 53, 99}
k=3, r={3, 5, 9}
h(s) = 98499
Similar sets have high probability of hashing to 98499.
Use multiple different LSH functions, get a union of results as candidates.
P(h(p)=h(q)) looks like this:
By controlling parameters, k and n the number of hash functions we can control steepness and position of the curve.
Simple trick: the hash functions of MinHash can serve as LSH functions.
Recall min hash signature matrix has Sets (column) vs hash functions (rows).
Partition k hash functions into bands b bands or r hash functions. (br=k)
b bands --> n LSH functions
r --> k
using System; using System.Collections.Generic; using System.Linq; using System.Text; namespace SetSimilarity { class LSH<T> { int m_numHashFunctions; int m_numBands; Dictionary<int, HashSet<int>> m_lshBuckets = new Dictionary<int, HashSet<int>>(); HashSet<T>[] m_sets; int[,] m_minHashMatrix; const int ROWSINBAND = 5; //first index is Set, second index contains hashValue (so index is hash function) public LSH(int[,] minHashMatrix, HashSet<T>[] sets) { m_numHashFunctions = minHashMatrix.Length/sets.Length; m_numBands= m_numHashFunctions / ROWSINBAND; m_sets = sets; m_minHashMatrix = minHashMatrix; for (int s = 0; s < sets.Length; s++) { for (int b = 0; b < m_numBands; b++) { //combine all 5 MH values and then hash get its hashcode //need not be sum int sum = 0; for (int i = 0; i < ROWSINBAND; i++) { sum += minHashMatrix[s, b*ROWSINBAND+i]; } if(m_lshBuckets.ContainsKey(sum)) { m_lshBuckets[sum].Add(s); } else { var set = new HashSet<int>(); set.Add(s); m_lshBuckets.Add(sum, set); } } } } public int FindClosest(int setIndex, MinHash minHasher) { //First find potential "close" candidates HashSet<int> potentialSetIndexes = new HashSet<int>(); for (int b = 0; b < m_numBands; b++) { //combine all 5 MH values and then hash get its hashcode int sum = 0; for (int i = 0; i < ROWSINBAND; i++) { sum += m_minHashMatrix[setIndex, b * ROWSINBAND + i]; } foreach (var i in m_lshBuckets[sum]) { potentialSetIndexes.Add(i); } } //From the candidates compute similarity using min-hash and find the index of the closet set int minIndex = -1; double similarityOfMinIndex = 0.0; foreach (int candidateIndex in potentialSetIndexes.Where(i => i != setIndex)) { double similarity = minHasher.ComputeSimilarity(m_minHashMatrix, setIndex, candidateIndex); if (similarity > similarityOfMinIndex) { similarityOfMinIndex = similarity; minIndex = candidateIndex; } } return minIndex; } } }
For some (large) number of sets with some (large) number of values, computing hash matrix is expensive, but after that finding the closes set takes 10000 times less than doing so by computing hamming distance using the Jaccard measure. Of course, one can compute n^2, distances beforehand, and at query-time do only n comparisons. Depending on the LSH design that may or may not take less time than fn fraction comparisons (because the constant overhead factor is larger).
The other advantage of MinHash+LSH is to reduce size of the signature - i.e size of data that is necessary to maintain (or communicate) in order to ask distance-related questions.
]]>View sets at a bit-array. Indexes representing each possible element, and 1/0 representing presence/absence of the element in the set.
Then Jaccard measure =
What happens when: n element in each set from a possible universe u, s.t. n << u?
Ok, as long as just |S1 U S2| is not too large.
Implementation is straightforward (In C#)
class JaccardSimilarity { public static double Similarity<T>(HashSet<T> set1, HashSet<T> set2) { int intersectionCount = set1.Intersect(set2).Count(); int unionCount = set1.Union(set2).Count(); return (1.0 * intersectionCount) / unionCount; } }
Intersection: O(nlogn) with sort-merge join, or O(n) with a big constant using hash join.
Union: O(n), again with some overhead.
Space is also O(n) at best.
Find a hash function sig (signature) such that sim(S1, S2) is approximated by sim(sig(s1), sig(s2))
Idea 1: Sample P elements from the universe, and define sig(S1) as bits for P elements (i.e reduce the sets to a random sample of the universe).
But problems with sparsity (n << u)
Idea 2: So don't count entries that are absent in both sets. E.g:
Four combinations:
A = 1, 1 (Element present in both sets)
B = 0, 1
C = 1, 0
D = 0, 0
Count: A / A+B+C
E1 | E2 | E3 | E4 | E5 | E6 | |
S1 | 1 | 1 | 0 | 0 | 0 | 0 |
S1 | 1 | 0 | 1 | 0 | 0 | 0 |
sim(S1, S2) = 1 / 3
Combine ideas 1 and 2.
Randomly permute element order (columns). Hash of S is element number of first element that is present in the bit-array.
Key insight:
P(h(s1) = h(s2)) = jaccardsimilarity(s1, s2)
Why? Both measures are A/A+B+C
This about this probabilistically..
Too fragile with a single permutation. Create min-hash signature (instead of a single integer) using N random permutations.
Then mhsig(S) = list of n indexes where element is present h(S)
Now, similarity(S1, S2) using min-hash is: Fraction of permutations where mhsig(S1) = mhsig(S2)
Expectation of similarity now is same as jaccard similarity measure.
We still can not implement this efficiently! Luckily there are some nifty tricks..
Instead of permuting rows n times, use n different hash functions and the let hash index order provide the random permutation.
But where is the min part in min-hash?
Foreach(set S)
Foreach(hash function h)
Find elements with that are present in S.
Compute hash of the element index if element is present
Keep the hash with minimum index value
This will give you the index of the first 1 from a random permutation
Computation cost of min hash is clearly higher, then why bother?
Answer: MinHash gives you a small signature of a (potentially large) set which can be used for similarity testing.
Summary: Signature generation time may be large, but querying time is smaller and takes less space.
A common use is in Nearest Neighbor type problem, where there are S Sets, and you wish to find the nearest one (or k nearest).
Compare all signature-pairs? Hint: use Locality Sensitive Hashing to eliminate pairs with dissimilar signatures -- Maybe my next post.
Here is the code..
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Diagnostics; namespace SetSimilarity { class MinHash { private const int m_numHashFunctions = 100; //Modify this parameter private delegate int Hash(int index); private Hash[] m_hashFunctions; public MinHash(int universeSize) { Debug.Assert(universeSize > 0); m_hashFunctions = new Hash[m_numHashFunctions]; Random r = new Random(11); for (int i = 0; i < m_numHashFunctions; i++) { uint a = (uint)r.Next(universeSize); uint b = (uint)r.Next(universeSize); uint c = (uint)r.Next(universeSize); m_hashFunctions[i] = x => QHash((uint)x, a, b, c, (uint)universeSize); } } public double Similarity<T>(HashSet<T> set1, HashSet<T> set2) { Debug.Assert(set1.Count > 0 && set2.Count > 0); int numSets = 2; Dictionary<T, bool[]> bitMap = BuildBitMap(set1, set2); int[,] minHashValues = GetMinHashSlots(numSets, m_numHashFunctions); ComputeMinHashForSet(set1, 0, minHashValues, bitMap); ComputeMinHashForSet(set2, 1, minHashValues, bitMap); return ComputeSimilarityFromSignatures(minHashValues, m_numHashFunctions); } private void ComputeMinHashForSet<T>(HashSet<T> set, short setIndex, int[,] minHashValues, Dictionary<T, bool[]> bitArray) { int index = 0; foreach (T element in bitArray.Keys) { for (int i = 0; i < m_numHashFunctions; i++) { if(set.Contains(element)) { int hindex = m_hashFunctions[i](index); if (hindex < minHashValues[setIndex, i]) { minHashValues[setIndex, i] = hindex; } } } index++; } } private static int[,] GetMinHashSlots(int numSets, int numHashFunctions) { int[,] minHashValues = new int[numSets, numHashFunctions]; for (int i = 0; i < numSets; i++) { for (int j = 0; j < numHashFunctions; j++) { minHashValues[i, j] = Int32.MaxValue; } } return minHashValues; } private static int QHash(uint x, uint a, uint b, uint c, uint bound) { //Modify the hash family as per the size of possible elements in a Set int hashValue = (int)((a * (x >> 4) + b * x + c) & 131071); return Math.Abs(hashValue); } private static Dictionary<T, bool[]> BuildBitMap<T>(HashSet<T> set1, HashSet<T> set2) { Dictionary<T, bool[]> bitArray = new Dictionary<T, bool[]>(); foreach (T item in set1) { bitArray.Add(item, new bool[2] { true, false }); } foreach (T item in set2) { bool[] value; if (bitArray.TryGetValue(item, out value)) { //item is present in set1 bitArray[item] = new bool[2] { true, true }; } else { //item is not present in set1 bitArray.Add(item, new bool[2] { false, true }); } } return bitArray; } private static double ComputeSimilarityFromSignatures(int[,] minHashValues, int numHashFunctions) { int identicalMinHashes = 0; for (int i = 0; i < numHashFunctions; i++) { if (minHashValues[0, i] == minHashValues[1, i]) { identicalMinHashes++; } } return (1.0 * identicalMinHashes) / numHashFunctions; } } }
My code is public, feel free to copy and optimize.
(Rest based on lecture notes of Rajeev Motwani, which in turn is based on notes from Jeff Ulman)
]]>Our focus in the database world has primarily been on retrieving information from a structured source, through a structured query.
Relational DBs --> structured data, structured query
XML DBs --> semi-structured data, structured query
search --> unstructured data, unstructured query
Actually, search --> data with hidden structure, query with possibly hidden structure
Recent efforts on IR considers data with no explicit structure:
Traditional Preprocessing steps
Basic Retrieval Models
Document Representation: Set of terms
Queries: Exact matches on presence of terms using logical operators AND, NOT, OR,..
No partial matches or ranking
Reasonably efficient implementations - think set operations in relational DBs queries
Rigid: hard to control: # of docs retrieved, rankings, relevance feedback
Document Representation: Bag of terms (with frequency)
Queries: Set of desired terms with optional weights
t distinct terms remain after preprocessing step, forming a vector space:
Dimension = t = |vocabulary|
Weight assigned to terms in the document, query (i, j): w_i,j
d_j = (w_1j, w_2j, …, w_tj)
Graphically..
Term-document matrix:
Each real-valued weight corresponds to the degree of significance of that term in the document. 0=no significance I, we, the, ..) or that it does not exist.
Term frequency (tf): tf_ij = frequency of i in j divided by frequency of most common term in j
Document frequency (df): number of documents containing term
Inverse document frequency of i: log_2 ( total # documents / df_i) Log dampens the effect relative to tf
tf-idf weighing commonly used to determine term weight: w_ij = tf_ij idf_i = tf_ij log2 (N/ df_i)
A term common across all documents is given less importance, and rare terms occurring in a document are given higher weight
Query is also treated as a document vector. Now the problem is to find similar documents
Need a way to compute degree of similarity between two vectors: Dot product!
Problems: Favors long documents. Does not measure how many terms don't match.
Compute similarity based on Cosine of the angle between two vectors.
With V vocabulary and D documents, complexity is O( |V| * |D|) !!
Key points:
Overall problems with Vector Space:
Falls in the arena of Fuzzy set theory
Similar to boolean retrieval model, except adjust strictness of logical operators with a p-value.
p = infinity --> Boolean model
p = 1 --> Vector space
Spectrum of operator strictness in between..
A good list of comparisons: (http://www.scils.rutgers.edu/~aspoerri/InfoCrystal/Ch_2.html)
Upcoming posts on IR:
More on preprocessing, indexing, approximate matches, index compression
Optimizations for VS model
ML algorithms for IR: clustering and classification,EM, dimensionality reduction, probabilistic models, HMMs, BayesNet
Maybe get into NLP
]]>Entropy of R.V. X is denoted by H(X)
= E[log( 1/p(X) )]
Intuition behind the formula:
Not so basic results and some applications of IT in Part 2 ..
]]>Source: On Random Sampling over Joins. Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, Sigmod 1999.
What?
Motivation:
Sequential, unweighted: CF semantics - easy.
Sequential, unweighted, WOR - easy: Reservoir Sampling
Sequential, unweighted, WR - Algorithms:
Black Box U1: Relation R with n tuples, get WR sample of size r
Need to know the size of n
Produces samples while processing, preserves input order, O(n) time, O(1) extra memory
x = r
i = 0
while(t = stream.Next())
{
Generate random variable X from BinomialDist(x, 1/(n-i))
result.Add( X copies of t)
x = x - X
i = i++
}
Black Box U2:
No need to know n . With some modification can preserve the order
Does not produce result till the end. O(n) time, O(r) space
N=0
Result[1..r]
while(t = stream.Next())
{
N++
for(j = 1 to r)
{
if(Rand.New(0,1) < 1/N)
Result[j] = t
}
}
return Result
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, "Courier New", courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
The above two algorithms can be easily modified for the weighted case:
Weighted U1
x = r, i = 0
W = sum of w(t), the weights for each input tuple t
while(t = stream.Next() && x>0)
{
Generate random variable X from BinomialDist(x, w(t)/(W-i))
result.Add( X copies of t)
x = x - X
i = i+ w(t)
}
Weighted U2
W=0
Result[1..r]
while(t = stream.Next())
{
W = W + w(t)
for(j = 1 to r)
{
if(Rand.New(0,1) < w(t)/W)
Result[j] = t
}
}
return Result
Example: R1 = {1, 2, ..., 1000}, R2 = {1, 1, 1, ..... 1}. Unlikely that R1(1) will be sampled, and SAMPLE(R1) SAMPLE(R2) will contain no result
Let m1(v) denote the number of tuples in R1 that contain value v in the attribute to be used in equi-join.
Strategy Naive Sampling: produces WR samples
Strategy Olken Sample: produces WR samples
Requires indexes for R1 and R2
Let M be upper bound on m2(v) for all values A can take, which is essentially all rows in R2 (?)
while r tuples have not been produced
{
Randomly pick a tuple t1 from R1
Randomly pick a tuple t2 from A=t1.A ( R2 )
With probability m2(t2.A)/M, output t1 t2
}
Strategy Stream Sample [Chaudhury, Motwani, Narasayya]
No information for R1, R2 has indexes/stats.
Sampling f fraction out of n records:
Last two can be seen by considering an adversarial case. Under the binomial distribution, there is a non-zero probability of sampling the entire population. Since any sampled subset can never recreate the entire population, S with R and S without R can never be converted to the coin flips semantics.
Source: S. Chaudhuri, R. Motwani, V. Narasayya; On random Sampling over Joins.
]]>A simple random sampling strategy to produce a sample without replacement from a stream of data - that is, in one pass: O(N)
Want to sample s instances - uniformly at random without replacement - from a population size of n records, where n is not known.
Figuring out n would require 2 passes. Reservoir sampling achieves this in 1 pass.
A reservoir R here is simply an array of size s. Let D be data stream of size n
Algorithm:
Partial analysis:
Base case is trivial. For the k+1st case, the probability a given element i with position <= k is in R is s/k. The prob. i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.
So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps)
= s/k * k/(k+1), which is s/(k+1).
So, when k+1 = n, any element is present with probability s/n.
It is very simple to distribute the reservoir sampling algorithm to n nodes.
Split the data stream into n partitions, one for each node. Apply reservoir sampling with reservoir size s, the final reservoir size, on each of the partitions. Finally, aggregate each reservoir into a final reservoir sample by carrying out reservoir sampling on them.
Lets say you split data of size n into 2 nodes, where each partition is of size n/2. Sub-reservoirs R1 and R2 are each of size s.
Probability that a record will be in sub-reservoir is:
s / (n/2) = 2s/n
The Probability that a record will end up in the final reservoir given it is in a sub-reservoir is: s/(2s) = 1/2.
It follows that the probability any given record will end up in the final reservoir is:
2s/n * 1/2 = s/n