Set Similarity and Min Hash

Given two sets S1, S2, find similarity(S1, S2) - based not hamming distance (not Euclidean).

Jaccard Measure

View sets at a bit-array. Indexes representing each possible element, and 1/0 representing presence/absence of the element in the set.

Then Jaccard measure = eq= \frac{|S_1 \cap S_2|}{|S_1 \cup S_2|}

What happens when: n element in each set from a possible universe u, s.t. n << u?

Ok, as long as just |S1 U S2| is not too large.

Implementation is straightforward (In C#)

 class JaccardSimilarity 
{ 
    public static double Similarity<T>(HashSet<T> set1, HashSet<T> set2) 
    { 
        int intersectionCount = set1.Intersect(set2).Count(); 
        int unionCount = set1.Union(set2).Count(); 

        return (1.0 * intersectionCount) / unionCount; 
    } 
}

Intersection: O(nlogn) with sort-merge join, or O(n) with a big constant using hash join.

Union: O(n), again with some overhead.

Space is also O(n) at best.

 

Hash similarity

Find a hash function sig (signature) such that sim(S1, S2) is approximated by sim(sig(s1), sig(s2))

Idea 1: Sample P elements from the universe, and define sig(S1) as bits for P elements (i.e reduce the sets to a random sample of the universe).

But problems with sparsity (n << u)

Idea 2: So don't count entries that are absent in both sets. E.g:

Four combinations:

A = 1, 1 (Element present in both sets)

B = 0, 1

C = 1, 0

D = 0, 0

Count: A / A+B+C

 

  E1 E2 E3 E4 E5 E6
S1 1 1 0 0 0 0
S1 1 0 1 0 0 0

sim(S1, S2) = 1 / 3

 

Min Hash

Combine ideas 1 and 2.

Randomly permute element order (columns). Hash of S is element number of first element that is present in the bit-array.

Key insight: P(h(s1) = h(s2)) = jaccardsimilarity(s1, s2)

Why? Both measures are A/A+B+C

This about this probabilistically..

 

Too fragile with a single permutation. Create min-hash signature (instead of a single integer) using N random permutations.

Then mhsig(S) = list of n indexes where element is present h(S)

Now, similarity(S1, S2) using min-hash is: Fraction of permutations where mhsig(S1) = mhsig(S2)

Expectation of similarity now is same as jaccard similarity measure.

 

We still can not implement this efficiently! Luckily there are some nifty tricks..

Instead of permuting rows n times, use n different hash functions and the let hash index order provide the random permutation.

But where is the min part in min-hash?

Foreach(set S)

  Foreach(hash function h)

       Find elements with that are present in S.

           Compute hash of the element index if element is present

           Keep the hash with minimum index value

This will give you the index of the first 1 from a random permutation

 

Computation cost of min hash is clearly higher, then why bother?

Answer: MinHash gives you a small signature of a (potentially large) set which can be used for similarity testing.

Summary: Signature generation time may be large, but querying time is smaller and takes less space.

A common use is in Nearest Neighbor type problem, where there are S Sets, and you wish to find the nearest one (or k nearest).

Compare all signature-pairs? Hint: use Locality Sensitive Hashing to eliminate pairs with dissimilar signatures -- Maybe my next post.

Here is the code..

 using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

namespace SetSimilarity
{
    class MinHash
    {
        private const int m_numHashFunctions = 100; //Modify this parameter
        private delegate int Hash(int index);
        private Hash[] m_hashFunctions;

        public MinHash(int universeSize)
        {
            Debug.Assert(universeSize > 0);

            m_hashFunctions = new Hash[m_numHashFunctions];

            Random r = new Random(11);
            for (int i = 0; i < m_numHashFunctions; i++)
            {
                uint a = (uint)r.Next(universeSize);
                uint b = (uint)r.Next(universeSize);
                uint c = (uint)r.Next(universeSize);
                m_hashFunctions[i] = x => QHash((uint)x, a, b, c, (uint)universeSize);
            } 
        }

        public double Similarity<T>(HashSet<T> set1, HashSet<T> set2)
        {
            Debug.Assert(set1.Count > 0 && set2.Count > 0);

            int numSets = 2;
            Dictionary<T, bool[]> bitMap = BuildBitMap(set1, set2);
            
            int[,] minHashValues = GetMinHashSlots(numSets, m_numHashFunctions);

            ComputeMinHashForSet(set1, 0, minHashValues, bitMap);
            ComputeMinHashForSet(set2, 1, minHashValues, bitMap);

            return ComputeSimilarityFromSignatures(minHashValues, m_numHashFunctions);
        }

        private void ComputeMinHashForSet<T>(HashSet<T> set, short setIndex, int[,] minHashValues, Dictionary<T, bool[]> bitArray)
        {
            int index = 0;
            foreach (T element in bitArray.Keys)
            {
                for (int i = 0; i < m_numHashFunctions; i++)
                {
                    if(set.Contains(element))
                    {
                        int hindex = m_hashFunctions[i](index);

                        if (hindex < minHashValues[setIndex, i])
                        {
                            minHashValues[setIndex, i] = hindex;
                        }
                    }
                }
                index++;
            }
        }

        private static int[,] GetMinHashSlots(int numSets, int numHashFunctions)
        {
            int[,] minHashValues = new int[numSets, numHashFunctions];

            for (int i = 0; i < numSets; i++)
            {
                for (int j = 0; j < numHashFunctions; j++)
                {
                    minHashValues[i, j] = Int32.MaxValue;
                }
            }
            return minHashValues;
        }

        private static int QHash(uint x, uint a, uint b, uint c, uint bound)
        {
            //Modify the hash family as per the size of possible elements in a Set
            int hashValue = (int)((a * (x >> 4) + b * x + c) & 131071);
            return Math.Abs(hashValue);
        }

        private static Dictionary<T, bool[]> BuildBitMap<T>(HashSet<T> set1, HashSet<T> set2)
        {
            Dictionary<T, bool[]> bitArray = new Dictionary<T, bool[]>();
            foreach (T item in set1)
            {
                bitArray.Add(item, new bool[2] { true, false });
            }

            foreach (T item in set2)
            {
                bool[] value;
                if (bitArray.TryGetValue(item, out value))
                {
                    //item is present in set1
                    bitArray[item] = new bool[2] { true, true };
                }
                else
                {
                    //item is not present in set1
                    bitArray.Add(item, new bool[2] { false, true });
                }
            }
            return bitArray;
        }

        private static double ComputeSimilarityFromSignatures(int[,] minHashValues, int numHashFunctions)
        {
            int identicalMinHashes = 0;
            for (int i = 0; i < numHashFunctions; i++)
            {
                if (minHashValues[0, i] == minHashValues[1, i])
                {
                    identicalMinHashes++;
                }
            }
            return (1.0 * identicalMinHashes) / numHashFunctions;
        }

    }
}

My code is public, feel free to copy and optimize.

(Rest based on lecture notes of Rajeev Motwani, which in turn is based on notes from Jeff Ulman)