Socks, birthdays and hash collisions



Nath knits awesome socks Suppose you’ve got a huge mixed-up pile of white, black, green and red socks, with roughly equal numbers of each. You randomly choose two of them. What is the probability that they are a matched pair?


There are sixteen ways of choosing a pair of socks: WW, WB, WG, WR, BW, BB, … Of those sixteen pairs, four of them are matched pairs. So chances are 25% that you get a matched pair.


Suppose you choose three of them. What is the probability that amongst the socks you chose, there exists at least one matched pair?


Well, we already know that chances are 25% after you pick out just the first two. If you get a matched pair right off, great. If you don’t, then there are two colours in hand you might match. So the odds are going to be a lot better.


There are 64 ways of choosing three socks: WWW, WWB, … and so on. Of those 64 possible combinations, 40 of them have at least one matched pair, so that’s about a 63% chance.


Suppose you choose four. There are 256 possible combinations, 232 of which have at least one matched pairs, so that’s a 91% chance.


Of course by the time we get to five socks, we have a 100% chance of getting a pair; five socks, four colours, there have got to be two alike.


It might appear that we’ve slightly messed up the probabilities here because once you choose one white sock, odds are slightly better that the next sock you pick will not be white, since there are now fewer white socks in the pile. But if the pile is big enough then we can neglect this minor problem.


From now on we’ll call getting a matched pair a “collision”.


It seems clear that as we increase the number of possible sock colours, we decrease the probability of getting a collision in some sample size. And as we increase the size of the sample, we increase the probability of the sample containing a collision.


Suppose you have 365 different colours of socks – perhaps each sock has a number on it giving its colour number, so that we can tell them apart – and a pile of about six billion socks, with roughly equal numbers of each sock colour. What is the probability that we’ll get a collision if we pull out two socks at random? One in 365, clearly. Three socks? A little bit better than double that.  And so on. To work out the exact probabilities we’d work out the number of possible combinations, and the number of those combinations that contain at least one collision.


Turns out that the point where you have a better than 50% chance of having a collision is 23 socks. This is the famous “birthday paradox”; if instead of 365 colours of socks we have 365 possible birthdays (ignoring leap years, the fact that more people are born on certain days than others, and so on) and we have a large group of people to choose from at random, then once you get to 23 people the odds are about fifty-fifty that two of them have the same birthday. By 50 people, chances are about 97% that two have the same birthday.


Which is maybe a nice party trick next time you’re at a party with 30 to 50 people – if you go around the room and ask everyone to say their birthday, odds are very good that two people will say the same day. But what’s my point?


Suppose you have just over four billion possible sock colours and a truly enormous supply of socks of each colour, such that each one is about equally likely. You start pulling socks out of the pile. What is the probability that you get a collision based on the number of socks you pull out? Four billion is an awfully big number compared to 4 or 365. What’s your intuition about the likelihood of a collision? How long until you have to start worrying about it?


Not nearly as long as you might think. I’ve worked out the math and summarized it in this handy log-log chart:


Collision


Man, is there anything better than getting a straight line on a log-log chart?


Anyway, you end up with a 1% chance of a collision after about 9300 tries, and a 50% chance after only 77000 tries. By the time you get into the mid six-digit numbers chances are for practical purposes 100% that there is a collision in there somewhere.


This is why it is a really bad idea to use 32 bit hash codes as “unique” identifiers. Hash values aren’t random per se, but if they’re well-distributed then they might as well be for our purposes. You might think “well, sure, obviously they are not truly unique since there are more than four billion possible values, but only four billion hash codes available. But there are so many possible hash values, odds are really good that I’m going to get unique values for my hashes”. But are the chances really that good? 9300 objects is not that many and 1% is a pretty high probability of collision.

Comments (34)

  1. Stuart says:

    Doesn’t the same logic apply to GUIDs? But then what’s the point of the GU part of the name?

    (Basically, I’m hoping that you can tell me that there’s something clever that GUIDs do to mitigate this issue)

    Yes, there are two clever things. First, as Micah notes below, GUIDs are not exactly random; they are based in part on the current time and the MAC address of the network card in the machine. Since no two machines in the world have the same MAC address, and since time keeps on slippin’ into the future, this for practical purposes guarantees that the GUID is unique in both time and space. (Machines that do not have network cards generate special GUIDs which are in a “known to be potentially not unique” range.)

    See Raymond’s article on this for more details:

    http://blogs.msdn.com/oldnewthing/archive/2008/06/27/8659071.aspx

    Second, my article was about 32 bit hashes. GUIDs are 128 bits. Instead of four billion kinds of socks in the pile, there are over 250 billion billion billion billion kinds of socks. That space is enormous. So the logic continues to apply; as the size of the space gets bigger — in this case, 64 billion billion billion times bigger — the number of samples needed to make a collision likely gets a few billion billion times bigger too. — Eric

  2. Micah says:

    GUID generators usually use time and MAC address to prevent collisions, but beyond that you could generate random 128 bit values until the heat death of the universe and be certain (enough) that you wouldn’t get a collision.

  3. Matt says:

    The same problem would apply to GUIDs… technically; but the probability space for them is large enough that they are (probably) (close enough to) globally unique. It depends on how they are generated, and a host of other conditions.

    Aside: From http://en.wikipedia.org/wiki/Birthday_attack , a 1% chance of collision on a (random) 128 bit value requires 2.6*10^18 samples… or 2.6 Exa-samples.

  4. jsrfc58 says:

    Eric wrote:

    “There are sixteen ways of choosing a pair of socks: WW, WB, WG, WR, BW, BB, … Of those sixteen pairs, four of them are matched pairs. So chances are 25% that you get a matched pair.”

    Perhaps, but aren’t you assuming that all the socks are the same type? What if some are children’s socks, women’s socks, etc.? What about wool socks and dress socks?

    To complicate matters, there are also other types of socks that are not worn–wind socks, drift socks, etc.

    Clearly more research needs to be done on this problem — Eric

  5. Erik says:

    Was this considered when the GetHashCode method on System.Object was designed?

    GetHashCode is not designed to provide a unique identifier; it’s designed to make it fast and easy to balance a hash table. — Eric

  6. Erik says:

    Also it’s amazing that all of us commenters have the same silhouette!

    <<<

  7. Femaref says:

    I liked how you lead to your point. Way better imagining the problem by an example than the usual "yeah, be careful because you can get a hash-collision". Good writeup concerning an important topic.

  8. Remus Rusanu says:

    The very same issue creeps up in database locking as well: http://rusanu.com/2009/05/29/lockres-collision-probability-magic-marker-16777215/. Because any lock is hashed into a simple ‘resource’ string, collisions are virtually guaranteed on large batches.

  9. Phil Nash says:

    I made a hash of my sock draw too. The chances of me wearing the same coloured socks on any given day gets lower as time slips into the future.

  10. Phil Nash says:

    Of course I meant "drawer" above. I made a hash of the joke too.

    (no edit button – grumble grumble)

  11. csharptest says:

    Interesting post and well articulated; however, I can’t for the life of me figure out why someone would use a hash as an identifier.  Frankly it’s much more convenient to use a naturally occurring key or, if none exists, a generated GUID.

    You can’t imagine why anyone would do something cheap, trivial, fragile and wrong when a correct and robust solution could be implemented by doing work and applying rational thought? Apparently you are new to this planet; we have this thing called “human nature”. :-)  — Eric

    I’m just curious what piece of code led to this blog topic ;)

    As noted below, this kind of thing comes up frequently on StackOverflow. Also, as one of my technical interview questions I pose a problem which requires the generation of a unique 32 bit integer; many people immediately go to random numbers or hash codes and are then unable to compute the probability of collision. — Eric

  12. DH says:

    Great post!

    Eric, "time keeps on slippin’ into the future" sounds a lot like Steve Miller :-)

  13. ShuggyCoUk says:

    @csharptest.

    I suspect this post is motivated by a lot of people asking Eric "how do you make a good unique id with an int"

    these same people are unlikely to be thinking of perfect hashes or some controlled sequence number (the only realistic solutions) they instead think GetHashcode() will do it for them.

    You see quite a few of these on StackOverflow.

  14. John says:

    Add don’t forget about the sock thief troll that has a fetish for blue socks…

  15. This is a great article but doesn’t necessarily help you out with a viable alternative. Some people often mistakenly turn to a hashcode (such as String.GetHashcode()) to generate an ID because they see it as random, alphanumeric, and they need something which isn’t just a number. Integer IDs are a nightmare to replicate / merge with other database machines or staging environments in your infrastructure, for example.

    For anyone who comes to this article from Twitter or a web search looking into the problem and who wants something high-performance that still supports replication, check out COMB GUIDs. It’s based on an article written way back in 2002 by Jimmy Nillson (link as of March 2010 is here http://www.informit.com/articles/article.aspx?p=25862&seqNum=7) and luckily has been implemented into NHibernate as Guid.Comb generation method.

    Cheers,

    Alex Norcliffe

  16. jrsfc58 says:

    Eric wrote:

    “As noted below, this kind of thing comes up frequently on StackOverflow. Also, as one of my technical interview questions I pose a problem which requires the generation of a unique 32 bit integer; many people immediately go to random numbers or hash codes and are then unable to compute the probability of collision.”

    But wouldn’t it be difficult to compute the probability of a collision on a random number if the person doesn’t know the algorithm used to generate that number?

    Exactly, yes. A truly random number should have the probability of collision outlined in this article. If its not truly random, then yes, the probabily of collision depends strongly upon the details of the algorithm used. — Eric

    Yes, more bits buys you a lowered risk of avoiding a collision, but I suppose to some extent it depends what one is going to do with that “unique integer”. If it is going to be used for cryptography purposes, it may work in the short term, but in the long term with the increase in computing power that is available all it probably really buys you is more time.

    Absolutely. One of the reasons I ask this as part of an interview question is I get to see whether the candidate will ask clarifying questions about how the code is being used. What you want to know is (1) what is a reasonably expected time to first collision? If the candidate can ask questions about how many of these “unique” numbers need to be generated per second then they can work out whether there’s going to be a collision every single time the program runs, or if we could go for millions of years with only a tiny chance of a collision. And (2) what are the negative consequences of a collision? If a collision means crashing the program or sending sensitive user data to an untrusted caller, then a solution which produces a collision in any reasonable amount of time is likely to be unacceptable. If the negative consequences are merely that a hash algorithm becomes a little sub-optimal, then no big deal. Good candidates clarify these things before they try to write code. — Eric

     

    Thanks for the chart/explanation…very informative.

  17. Schmuli says:

    Based on your previous commented reply, Eric, are you implying that when using a HashTable, with its algorithm for collisions, this is not a serious issue, even with a million or so entries, updated every couple of seconds? Even if the ID of the entry is based on a string value (say, a GUID), in the hash table this ID will still be hashed first, so what have I gained from using a string value?

    Also, as mentioned in @ShuggyCoUk’s comment, what is a perfect hash? I have recently been involved with Java development, and there the direction is to always override the ‘hashCode’ and ‘equals’ methods, with the hash-code usually involving prime-numbers and bitwise operations on 64-bit values.

  18. Pavel Minaev says:

    > Also, as mentioned in @ShuggyCoUk’s comment, what is a perfect hash?

    A perfect hash is a function hash(x) such that, for any x != y, hash(x) != hash(y). Simply put, it’s a hash which has no collisions.

    >  I have recently been involved with Java development, and there the direction is to always override the ‘hashCode’ and ‘equals’ methods, with the hash-code usually involving prime-numbers and bitwise operations on 64-bit values.

    It is certainly not common practice to override equals/hashCode in Java (nor Equals/GetHashCode in .NET) for every random class. It is normally done for those classes which either 1) have value semantics (e.g. String), or 2) have their own identity distinct from runtime object identity (e.g. ORM entities with explicit primary key fields – as seen in Hibernate etc).

    If your entity class has a unique ID which is within the domain of hash values (i.e. 32-bit integer for hashCode & GetHashCode), then that ID is the perfect hash for that class. Otherwise the usual dance with "prime numbers and bitwise operations" is done on the primary key to trim it to desired size in a more-or-less evenly distributed way, but then the result is not a perfect hash anymore.

  19. tvanfosson says:

    I had a friend whose wife tied his pairs of socks together so that he’d have 100% probability of getting a matched pair on the first try.  This was because he would stop after picking the second sock whether it matched or not, no matter how many were in the drawer.

    Hey, if I turned on the lights to get a matched pair of socks, I’d wake her up. Oh, wait, you weren’t talking about me and my wife… as you were. — Eric

  20. Paul says:

    I just want to clarify is it ok to use GetHashCode for a Quick Check of the content of a String for example, and if you do get a match couple that with a Real Check?

  21. Anon says:

    @Paul, My intuitive answer is no, it’s not ok. Intuitively, I think it could take longer (having to generate a hash code). The lazy programmer in me would stick with String.Equals and it’s plethora culture-aware overrides. If it is actually faster, I reckon string.Equals already does it for me.

    But what do your benchmarks say? Does your profiler indicate you’re spending 80% of your time in String.Equals()? What about case-(in)sensitive and/or culture (in)sensitive comparisons?

  22. Harold says:

    Phew, looks like buying all those GUIDs on Ebay was a deal after all.

  23. Tim Barrass says:

    "Man, is there anything better than getting a straight line on a log-log chart?"

    Two very big thumbs up.

    Great article, thanks.

  24. @csharptest: I am guilty! I have used the hashcode as an identifier!

    Well, I need to weaken this statement, let me explain: In very seldom events, our sever may get some kind of hiccup when dealing with a request: The exception’s hashcode will also be written into the log file. This id can be picked up at the client which made the request.

    My reason of thought was that the event of an unexpected exception is seldom (the chart appears to prove me right) and that we do actually have additional correlation clues, like the day and possibly even the time. Plus, if there would be a collision it would be far from the end of the world. So far, this lil’ correlation id has served us well.

  25. jsrfc58 said: To complicate matters, there are also other types of socks that are not worn–wind socks, drift socks, etc.

    Eric said: Clearly more research needs to be done on this problem

    I’ve been trying to think of a way of punning on garter and Gartner for too many minutes now. It just ain’t gonna happen.

  26. Anthony P says:

    There are also multicolored socks. For example, former-president Clinton’s cat.

    The former president’s former cat. Socks died in 2009. — Eric

  27. Gabe says:

    As a quick first-order approximation, cut the number of bits in half to get a 50% chance of a collision and shift right 3 bits to get a 1% chance. E.g. if you have 2^32 sock colors, choose approximately 2^16 socks to get a 50% chance of a duplicate or 2^13 socks to get about a 1% chance of a duplicate color.

  28. Drew says:

    Everyone talks about version 1 GUIDs (time & mac address), but almost all software generates version 4 GUIDs (random numbers).  I’ve experienced at LEAST 10 duplicate GUIDs in the last 8 years.

  29. Jack says:

    Its useful to point out that 23 = sqrt(365), and that the point at which it is likely (>=50%) that you will have a collision is at the squareroot of the size of your keyspace.

    Following from this, you can trivially calculate the tipping point for any hash given its length, as it is 2**(hashlength/2)

    Futhermore, you chances are that good if you assume random uniform distribution. When you’re hashing a bunch of items, they probably have things in common that make them less than perfectly random.

    Moral of the story: Be careful using hashes as UIDs, but if you’re using a 128bit hash you will *probably* be fine.

  30. Random832 says:

    >> Everyone talks about version 1 GUIDs (time & mac address), but almost all software generates version 4 GUIDs (random numbers).  I’ve experienced at LEAST 10 duplicate GUIDs in the last 8 years.

    That is statistically very unlikely (even if it’s not impossible) unless they’re being generated in a particularly poor manner.

    note that using System.Random or C’s rand() function or anything else that only starts with a 32-bit seed qualifies – you’ve basically got a 32-bit random* number (the original seed) that’s been stretched out to 122 bits without adding any more real randomness, since all your bits are dependent on that original seed so you can only generate 2^32 possible GUIDs. You’ll have that same high probability of collisions with semi-random GUIDs generated in the same way by the same implementation, and an unknown (somewhere between that high probability and zero) probability between it and different implementations.

    *or not even really so, since the default seed for System.Random is time-dependent… and rand()’s _default_ seed is fixed, but its _typical_ seed is going to also be time-dependent

  31. JM says:

    @Paul: no, this is pointless. Computing the hash of a string requires looking at every character, and then you might as well compare those pairwise while you’re at it. If you have to do expensive culture-sensitive equality comparisons it *might* pay off, but it’s the kind of optimization you’d have to be very sure about (per Anon, measuring carefully would not be optional). It’s highly unlikely culture-sensitive string comparisons are a bottleneck in your application — and if they are, it’s more likely you want them to be culture-insensitive instead in any case.

    As to why hashing uses every character: it doesn’t need to, but if you don’t look at every character, collisions become extremely likely for some input sets (namely the ones that have identical characters in whatever positions you’re checking, which is very common with generated strings). In general that’s not a good way of getting a performance gain, and String.GetHashCode() dutifully looks at every character.

    People sometimes suggest that the framework should pre-computer or cache hash values of strings to speed up comparisons (or just hashing in general), but this is silly for the simple reason that this would penalize *every* string creation for the sake of speeding up repeated comparisons. This doesn’t pay off — string creation is vastly more common than repeated string comparison against the same string instance (mind you, it needs to be the same *instance*, we can’t know it’s the same *characters* as that’s exactly the problem we’re trying to solve). Of course, when stored in a dictionary, the hash value *is* kept along with the key, because we *do* have repeated comparisons. In C#, a string-based "switch" statement is implemented as a dictionary lookup by the compiler, again a situation where you do have repeated comparisons which is dutifully optimized.

    To summarize, computing hashes makes sense when you’re hashing, it doesn’t make sense in general. Sounds obvious, but in my experience it’s remarkably hard to convince people of this — they get fixated on the "comparing hash values is faster" angle without being able to consider the total cost of all operations.

  32. stan says:

    Say you want to compile a list of unique names from a phone book. Obviously some type of hash table would be the way to do this, but given the comments above, collisions seem likely. So the question would be which hashing algorithm would be optimal for avoiding collisions?

  33. MikeH says:

    "… straight line on a log-log" of course that only goes so far (since of course it’s never going to go above 100%) but is there a graph that doesn’t tail-off?  

    Could it be that if we compare socks pulled out with expected number of pairs that the straight line will continue. It seems reasonable that expected number approximates to percentage with small percentages but (switching examples) if you ask 364 people for their birthday you will still not get 100% but you would expect to get dozens(?) of collisions. (sadly I’m not sure how to calculate this and how many pairs of socks do you have when you have 3 matching socks anyhow?)

    …or how would we draw the chances of NOT getting a pair as you pull out large number of socks (there’s a factorial in there somewhere!)?

  34. Harold says:

    Hey guys, you are aware that .NET generates "V4" GUIDs by default, right? So it’ll be a pseudorandom number, not your MAC address.