Understanding hash codes


On more than one occasion, I’ve seen someone ask a question like this:

I have some procedure that generates strings dynamically, and I want a formula that takes a string and produces a small unique identifer for that string (a hash code), such that two identical strings have the same identifier, and that if two strings are different, then they will have different identifiers. I tried String.GetHashCode(), but there were occasional collisions. Is there a way to generate a hash code that guarantees uniqueness?

If you can restrict the domain of the strings you’re hashing, you can sometimes squeak out uniqueness. For example, if the domain is a finite set, you can develop a so-called perfect hash which guarantees no collisions among the domain strings.

In the general case, where you do not know what strings you are going to be hashing, this is not possible. Suppose your hash code is a 32-bit value. This means that there are 232 possible hash values. But there are more than 232 possible strings. Therefore, by the pigeonhole principle, there must exist at least two strings with the same hash code.

One little-known fact about the pigeonhole principle is that it has nothing to do with pigeons. The term “pigeonhole” refers to a small compartment in a desk into which items such as papers or letters are distributed. (Hence the verb “to pigeonhole”: To assign a category, often based on gross generalization.) The pigeonhole principle, then, refers to the process of sorting papers into pigeonholes, and not the nesting habits of members of the family Columbidae.

Comments (23)
  1. DrPizza says:

    "One little-known fact about the pigeonhole principle is that it has nothing to do with pigeons"

    Erm. Yes it does. The desk compartments are named for their similarity to the compartments in pigeon lofts.

  2. DavidK says:

    Pigeonhole may mean a small compartment, but it is derived from postal uses, in which posts is deposited in a small compartment during sorting.

    This itself is derived from pigeons, as pigeon messenger services is the link to postal services. Messenger pigeons belonging to various districts would be housed in one system of compartments, and you would deposit your message in the compartment, and the pigeon keeper would attach it to the bird and release it.

    So it is derived from pigeons, just indirectly.

    Your statement remains me of one made at the Office 12 TAP recently. In regards to Dog Food #4 of the build of Office it was remarked by the presenter (Head of Office, used to be Head of Project, forgotten his name… Chris Cappasella perhaps?) that he had no idea where "Dogfood" came from, and he appears to have mis-applied it anyway.

    Dogfooding comes from Mars Confectionary in the UK. Whereby everyday the management and board members would go to the factory floor and sample each and every product as a quality test. At the time they also made dog food, and they would sample that too… thus was coined the phrase "Eat your own dog food", and from that comes "Dogfood" as a general term meaning to consume your own output, in coding it should ideally mean that Mr Gates would also use Office 12, freak about things that broke and get them fixed before they’re sold.

    Some etymology for you :)

  3. oldnewthing says:

    Duh on me. It comes ultimately from pigeons but indirectly…

  4. DavidK says:

    No worries… how’s your Swedish coming along btw?

    That was actually how I first stumbled onto your blog even though I read a fair few of the MSDN blogs, work for a partner, etc.

    I’m just trying the Level 1 Rosetta Stone course, I’ll let you know how it goes if you like.

    My attempts at evening class were terrible, I could blame the teacher for not knowing the grammar or etymology, but that’s just making excuses.

    Have you ever checked out Francis Strand? http://www.francisstrand.blogspot.com/ That helps, but maybe not as much as exposure does.

    My big problem now is how to combine cycling (which I do in London) with living in Gothenburg (which I hopefully will do very shortly). It’s a problem because unlike sidewalks that attack momentarily, they suffer from severe winters that attack for about 4 months of the year. Might end up hacking a heated blanket into some sort of biking suit and powere by dymo!

  5. Pseudo Masochist says:

    My God, somebody break out the bubbly! This is the first Raymond Chen technical topic that I not only understood completely, but knew everything that you wrote before reading the article.

    I guess even a blind squirrel finds a nut every once in a while. :)

  6. The English language is much more poetic than the French language for this matter. In French, this is known as ‘le principe des tiroirs’, which translates to ‘the drawers principle’. Much less poetic than this pigeon story !

    BTW, Pseudo Masochist, did you also know the etimology of pigeonhole ? ;-)

  7. michkap says:

    Reminds me if a post of mine explaining why GetHashCode() in the .NET Framework was not really the right solution for string indexes….

    http://blogs.msdn.com/michkap/archive/2005/02/18/376009.aspx

  8. Back to original task of creating

    sting-to-numeric-id functions/maps.

    Ternary search trees are generally better approach for that (compact and fast) and do no require non-deterministic hash functions.

    Tested by myself on real tasks (HTML/DOM parsing/rendering). Indeed, it shows significant gains in some cases. Especially in places where hash function were not designed properly.

    In general it is hard to create good hash functions – it’s design is a serious research as a rule.

    Canonical TST paper:

    http://www.cs.princeton.edu/~rs/strings/

    And about perfect hashes. There is a well known utility named gperf.exe (google:gperf.exe) – it allows to generate automaticly perfect and minimal string-to-id functions for C/C++.

  9. Ytram says:

    I’m posting this here just because it’s your last tech entry. Here the past 3 or 4 weeks, I have been getting duplicate entries from you, 15 at a time. Your 15 most recent entries get resent to my aggregator, and a minute or two later I get them again, 30 in all.

    This may have something to do with my reader of choice(Thunderbird), but if it’s not, I wanted to be sure and let you know in case it has something to do with your auto-pilot program.

  10. User says:

    i think you got i wrong about the origin of "pigeonhole". its origin is in the shape of storage compartment for *real pigeons* used in antiquity when pigeons were used for message delivery carrier pigeons). they were stored in small square holes in a similar shape to the desk compartment you linked. it’s a very old thing – there are carrier-pigeon rooms carved in stone dated from the roman empire period (for example here in israel there are few of these, i’m sure you can find them in other places too)

  11. Mr. Ed says:

    It’s always awesome when you can fit "by the pigeonhole principle" into a conversation.

  12. Miral says:

    Well, there does in fact exist a hash for unrestricted strings that satisfies uniqueness — the string itself. Nothing else will do. (ducks behind a table to avoid thrown objects.)

    And also, it’s spelled "etymology", not "etimology".

  13. Michael Fitzpatrick says:

    Can’t you do a CRC on the string to get a pseudo unique hash. To ensure its unique, add 2 bytes for the length.

  14. Norman Diamond says:

    Wednesday, August 31, 2005 7:19 PM by Miral

    > Well, there does in fact exist a hash for

    > unrestricted strings that satisfies

    > uniqueness — the string itself. Nothing

    > else will do.

    That is in fact true, in the general case. Everything else (including CRCs, including CRCs plus length fields, etc.) will have collisions sometimes. The purpose of CRCs was so that the most common kinds of transmission errors would become most reliably detectable. Not perfectly detectable, only as close as practicable with fixed-length check fields.

    In the original note:

    > For example, if the domain is a finite set,

    > you can develop a so-called perfect hash

    > which guarantees no collisions among the

    > domain strings.

    It is more complicated than that. Ordinarily you want a fixed-length hash key, and then you can’t use the string itself as a key, and then Miral’s hashing method stops working for infinite sets. But whether that leaves you with a possibility of a perfect hash or not depends on another factor. It is not enough for the domain to be finite, but the entire domain must be known exactly. When you know exactly what the strings are, there will be a way of hashing them into hash keys that are just long enough. For example a perfect hash can be produced for the keywords of the C programming language, but not for all of the identifiers that will be used in all of the C programs that will be written during the next 50 years.

  15. binaryc says:

    If you want to generate a unique hash code for every string, you have to do something like this:

    long long GetHashCode( const char *str ) {

    return *(long long *)str;

    }

    of course, that only works for strings that are 16 bytes long.

  16. Tim Smith says:

    As already stated, HASH + Length does not generated a unique value. Also, there is a downside to trying to generate perfect HASH values. As the time complexity of the HASH algorithm increases, the usefulness of using them goes down. In other words, it can take longer to generate the HASH value than to do the search.

    Also, in some of my compiler research, I found that adding the string length to the HASH value compare doesn’t do much for reducing the string compares.

    BTW: What I mean by adding the string length compare is this:

    if (Hash == Test.Hash && strcmp (string, Test.String) == 0)

    compared to:

    if (Hash == Test.Hash && Length == Test.Length && memcmp (string, Test.String, Length) == 0)

  17. Drew says:

    This isn’t as entertaining as the origins of the word "pigeonhole", but I’ll toss it in because it might help someone:

    String.GetHashCode() is peobably fine if you’re using it to populate whatever hash table you’ve implemented, because it already expects collisions and has code to deal with them, right? It’s also fine as a quick check for inequality as long as the code doesn’t take equal hashes to mean equal strings. On finding equal hashes, a check of true string equivalence would have to happen, but think of all the cycles you save not doing that most of the time.

    To best avoid collisons use a known cryptographic hash algorithm. And if you’re a dev in Windows use one of the stronger ones (not MD5, RC4, et al.) or my PM may have to beat you up (depending on the security implications, mitigation against collisions, etc.).

  18. Phaeron says:

    What people don’t realize about hashing is that it doesn’t take very many entries before the chance of a hash collision starts climbing alarmingly. For instance, 100K entries in a 32-bit hash space is a density of less than 0.01%, but the chances of at least one hash collision are nearly 70%. Even worse, that’s assuming a totally unbiased hash function, so actual chances are higher.

  19. TC says:

    Drew said:

    > To best avoid collisons use a known

    > cryptographic hash algorithm. And if

    > you’re a dev in Windows use one of the

    > stronger ones (not MD5, RC4, et al.)

    Um, RC4 is not a hash function :-(

    As a total crypto amateur, it really surprises me how many "schoolboy howler" mistakes MS has made with its crypto. For example, enciphering passwords with symmetric ciphers & secret keys, instead of using the passwords /as/ the keys, to encipher a fixed constant. The first method is trivially breakable & returns all the plaintext passwords! (oops) :-((

  20. ericgu says:

    There’s a bit of history on the use of "pigeonhole" in mathematics here:

    http://members.aol.com/jeff570/p.html

  21. Dimitre Novatchev says:

    Raymond,

    This is actually known as the Dirichlet’s Box Principle.

    http://mathworld.wolfram.com/DirichletsBoxPrinciple.html

  22. Yaytay says:

    There are lots of perfect hashing algorithms that don’t result in the original string and do produce something considerably smaller than the original string.

    They are called (lossless) compression algorithms.

    That might be useful to someone, somewhere.

  23. RobL says:

    There are lots of perfect hashing algorithms

    <snip/>

    >They are called (lossless) compression algorithms.

    Yes – but they take TIME.

Comments are closed.