Bloom filters

Imagine you're writing a simple spell checker in C. You've already collected a dictionary of 100,000 words, and you want to process the document a word at a time and find any words that aren't in the dictionary. You don't care about providing suggestions for wrong words. How would you go about this?

Clearly, comparing each word in the document to every word in the dictionary won't do. You could use some kind of trie or binary search tree, but these have quite a bit of space overhead for pointers. The most obvious solution is to use a hash table containing all the words in the dictionary. Since hash tables with good hash functions and linear probing are fairly efficient up to about 80% utilization, you only need about 125,000 entries in the table. It seems hard to beat this.

Now, imagine you're writing this spell checker in 1985. You have 640K of memory. There's no way the hash table above could fit; the words alone would take at least 500K. Are we sunk? Actually, we're not, and I'll explain why - but first let's review hash tables a bit.

Recall how lookup works in a hash table. You use the hash function to transform the key into an index into the table. You go there, and if the table entry is empty, the key is not present. Otherwise, you must compare the key at that location to the desired key to make sure it's the right one, and not just another key that happened to map to the same location. This is too bad, because this comparison is expensive. For example, if I look up the misspelled "opulant", I may get the hash table entry containing "cheeseburger", by sheer coincidence. But really, what are the chances?

It can be shown that if the hash function is good, the chance of such a collision is roughly 1 - e-n/m, where n is the number of words in the dictionary and m is the size of the table. If we choose m=32n, this is about a 3% chance, which is pretty low. How can we use this? It allows us to remove the key comparison. If we look up a key and find that the table entry is in use, we simply claim that the word is spelled correctly. About 3% of misspelled words will be missed, but it won't catch all errors anyway - the user may be able to tolerate this.

Unfortunately, trying to fit a hash table with 3,200,000 entries in 640K doesn't seem any easier. The key is to notice that we no longer need to have the keys in the table at all - we only care whether or not each entry is in use. We can store this information in a single bit, and 3,200,000 bits is 400K. We fit!

This idea of exploiting the rarity of collisions to decrease time and space requirements, at the expense of occasionally being wrong, is the same idea behind Bloom filters, a data structure invented by Burton H. Bloom in 1970 (back when space really mattered). In fact, what we have described above is already a simple Bloom filter, and is said to have been used by real spell-checkers in the bad old days. However, Bloom filters also generalize this idea in a way that shrinks space requirements even further.

At NASA, one of the ways they avoid having (too many) horrible accidents is by having many independent backup systems. Even if some of the systems fail, the chance that they'll all simultaneously fail is miniscule. We can use the same idea here: instead of mapping a key into a single table entry, use k independent hash functions to map it to k locations. Set all these locations' bits to 1. Then, when we look up a word, we say that it's spelled correctly only if all k of its locations are set to 1. Here's some pseudocode:

 function insert(key) {
    for each hash function f
        table[f(key)] = 1;
}

function contains(key) {
    for each hash function f
        if table[f(key)] = 0 then return false;
    return true;
}

If the word is not in the dictionary, the chance of contains returning true is only about:

(1 - e-kn/m)k

Recall that n is the number of words in the dictionary, and m is the table size. For a given m and n, the optimal k is about (log 2)m/n. If we set k=44, the optimal value for our table above, the chance of a misspelled word getting through shrinks to the ridiculously tiny 4 x 10^(-14), making it about as reliable as the hash table approach. If we set k=6 and m=8n, this gives about a 2.1% chance, even less than before, yet our table only requires a byte per word, or 100K. There are various other enhancements that can squish this down even further.

Of course, nowadays we don't have this sort of memory constraint, and spellcheckers can afford to load their massive dictionaries into main memory. So what are Bloom filters still good for? The answer is that there are still many problems where the sets that we deal with are far too large to fit in any memory. For example, one way of detecting infinite loops in a function is to simulate it while keeping a set of all its previous states, including local variable values and the current location. If it ever reaches a state it's already been in, it's in a loop. Unfortunately, maintaining a set of all its states after every instruction could quickly fill any memory. Instead, we use Bloom filters to store the previous states; we may get a false positive for an infinite loop, but the chance is small, and multiple runs can exponentially shrink this chance. Ideas like this are used in some modern probabilistical model checking algorithms for verifying systems; for example, see Bloom Filters in Probabilistic Verification.

More generally, any time you want to remember a large number of chunks of data, particularly large chunks, Bloom filters are handy to have around. The error is not really an issue, since you can tweak the parameters to get the error as low as your application requires. They're particularly important on PDAs and embedded systems, which still have orders of magnitude less memory than desktops.

Unfortunately, implementations are lacking. No language that I know of has Bloom filters in its standard library, and sample code is hard to find. When I get a bit of time I'll write up a little implementation in C++ and/or C# for demonstration and reuse. Please ask if you have any questions, and I'll try to explain things further. Thanks for reading.

Derrick Coetzee
This posting is provided "AS IS" with no warranties, and confers no rights.