So, you hear all the time – Hashtable, hash value, etc. You’ve seen GetHashCode method in Object type. But what is hashing?

Hashing is a process of applying one of many hash algorithms. It’s frequently used to create/use hash tables, which significantly increase search efficiency making it a constant time O(1) (compare that binary tree search, which is a function of O(logn)). A hash table is simply an array of data (the actual table where the data to be searched is stored) and a mapping function, known as a hash function.

http://en.wikipedia.org/wiki/Hash_function site gives the following definition:

A hash function or hash algorithm is a function for examining the input data and producing an output of a fixed length, called a hash value. Two different inputs are unlikely to hash to the same hash value.

There are many hash functions. You might’ve heard of secure hash algorithms such as SHA, or other like MD2, MD4, MD5, etc. But to understand it, let’s look at two simple examples below (as Damien Morton correctly stated in the comment below, these are not good examples for “real-life” usage, but they do convey the point in an easy to understand way):

public int DivisionHashCode(string data)

{

// Division method hashing

int result = 0;

foreach (char c in data.ToCharArray())

{

result += c;

}

result %= 2053; // let’s assume our max string is 2048 chars… find the next prime number

return result;

}

public int MiddleSquareHashCode(string data)

{

// Middle square algorithm, where M = 128, k = 7, w = 1

int result = 0;

foreach (char c in data.ToCharArray())

{

result += c;

result = (result*result)>>6; // Thanks to Damien Morton for the correction of a bug in the original post

}

return result;

}

The returned hash value is, in essence, the index into the array of data. So, instead of searching for a match by comparing each key (linear search), or by leveraging the powers of the binary tree search, we simply convert the key (string) to a hash value (integer), which gives us an index into the data table. Then, it’s a simple data[index] operation to get the value…

While simplified, this, in essence, is how hashing works…

*** Special thanks to Damien Morton for the comment below ***

If you’re going to write about hash functions, you should probably not use these spectacularily bad ones.

The first hash function is the cannonical bad hash function. "abc" hashes to the same place as "bac" and "cba" and so on and so forth.

The second hash function is also similarily bad. Its not even a correct implementation of mid-square. (c*c)>>6 loses information.

You probably meant something like:

result += c;

result = (result*result)>>6;

even this is lossy.

What makes a good hash function? Well, a hash function generally has a state, and an operation by which new data is mixed with the state.

state = 0

foreach (c in data)

state = mix(state,c)

return state

The goal of a good hash function should be to make it so that any given bit of any given part of the input has an 50-50 chance of affecting any of the bits of the output. Under some circimstances, you can relax this a bit, for example where only a small number of bits of the hash are used.

With hash functions, theres lots of ways to skin the cat.

Probably the most widely used hash function is the FNV hash, which goes like this:

hash = init

foreach (c in data)

hash = hash * prime

hash = hash ^ c // this is an XOR operation

return hash

You have to be careful with this hash, as the low bits arent as random as the high bits. Best to shift down rather than mask off when calculating a hash bucket index.

You have not mentioned collisions where two different values have the same hash code. With huge amount of data (let’s say more than 2^32), hashing to an int like GetHashCode will result in collision even if the hash algorithm is perfect.

G