# Are Hash Codes Unique?

While you will hear people state that hash codes generate a unique value for a given input, the fact is that, while difficult to accomplish, it is technically feasible to find two different data inputs that hash to the same value. However, the true determining factors regarding the effectiveness of a hash algorithm lie in the length of the generated hash code and the complexity of the data being hashed.

Let's first talk about the hash algorithms themselves. Hashing algorithms generate a fixed-length hash code regardless of the length of the input. For example, the MD5 hash function always generates hash codes that are 32 bytes in length, the SHA1 hash function generates 20-byte hash codes, SHA256 generates 256-bit (32 byte) hash codes, and so on. Therefore, since there are a limited range of possible values for a given hash code and an unlimited range of values to hash, it stands to reason that the length of the hash code generated with a given hash algorithm is directly related to the difficulty of finding two inputs that will generate the same hash code.

This is easy to prove. If n is the number of possible hash codes, we only need n + 1 distinct input values to prove that there is an overlap. Granted that for most hash functions, n is rather large so we can safely assume that for any meaningful input values, it would be very difficult to find another meaningful input that would give the same hash.

That brings me to the second point—the input value itself. It is assured that if the input is anything more involved than random data, then even if you were to find two inputs that generated the same hash value, the two inputs would have no semantic relationship to one another. This is especially evident if you are hashing text, because there are many more ways to produce gibberish than there are ways to produce meaningful words. In other words, it's extremely difficult to create random words, even harder to form sentences, and very unlikely that those sentences will form a paragraph or a document and still have it make sense.

As an example, let's say you're generating a hash code for a top-secret document that details the route of nuclear warheads for disposal. The chances of someone randomly generating words until they matched the hash code generated for the original document and producing a document that makes any sense are so small that you could accurately say it would be impossible.

Therefore, the fact that each possible hash code doesn't uniquely match an input value only comes into play when you are dealing with random (nonstructured) data. An example of this would be spyware-removal applications that determine if a file is spyware by hashing each file in a specified folder or drive and then comparing that hash value to a list of known spyware-file hash codes. In this case, relying solely on the hash code would be a mistake as the files being hashed are of varying lengths with many files having no semantic meaning (as the application is not determining the meaning of the data; only the hash code values). As a result, it would be highly recommended that these applications either match on file name before hashing the file's contents (which dramatically reduces the possibility of "false positive" results) or use a hash code with a very long output value - such as SHA256.

Therefore, while there isn't a one-to-one correlation between every possible hash code and every possible input value such that all combinations are guaranteed to be unique, hash codes are an extremely reliable means of protecting data integrity.

Related Reads: Keith Brown did a full article on securing the username token with WSE 2.0 that is a good read even if you're not doing Web Services as it talks about general security issues regarding the protection of user authentication information that is being sent over the wire.

1. Tarang Waghela says:

Interesting article and a nice read

Cheers

2. Paul Watson says:

Interesting article, thanks Tom.

3. I wonder why then, why many popular sites require a fixed password length instead of allowing a user to create a password of any length they want. It seems that it would be safer to remove the length rule as this way someone who is trying to brute force their way in will have a tougher time than if he knew that the password was always 6-10 characters long.

4. Austin Lamb says:

A minor correction – SHA-256 is not a 256-byte hash, but rather a 256-bit hash (32 bytes).

5. Tom Archer says:

Paul: Thanks! I’m flattered that you read my stuff 🙂 I’m hoping now that I’m a little more situated here at Microsoft that I can begin blogging much more.

6. Tom Archer says:

Austin: DOH! Thanks for the catch!

7. Tom Archer says:

Bryan: I’m in total agreement with you. On a related topic, TechNet has a great article on "security myths" and one thing they bring up is that the better way to protect yourself is NOT with "strong" passwords, but with *long* passwords.

8. Wow, that was a really good read. Thanks for the link, Tom

hi  i downlaoded  a version of vista  (vista beta 1) but when I want to install it . i’d have an error  written in this kind

the windows can’t get the information of your disks

also i should mentione that i have  sata2 hard

what’s wrong with my system ?

11. Dan says:

Good reading Tom!.  Security starts with the developer.  It’s been years since I did serious crypto work and I had actually forgotten about using salt values in stored password (though thanks to micrsoft I had been hashing them for many years).  Great work!

12. Swapnil Kocheta says:

Wow

Geek Stuffs

Quite amazing to read from the employee of ur dream company

nice work Tom

13. Divya says:

Tom, Great piece of information.

Couldnt find it anywhere on the net so clear.

14. In the chess world, positions being searched are hashed as a chess program "thinks." If a position that has already been evaluated arises again, via a different move order for example, a program need only access its hash table (in RAM) to extract the score for that position, saving the redundant and time consuming call to the evaluation function.

Even in the 32-bit world, a hashed chess position (32 bits) and a hash key (another 32 bits) will be able to avoid "hash collisions" (a different chess position generating the same hash information) almost forever, or 1 time in 2 to the 64th trials. Even with programs searching 10 million positions per second, you would be able to "survive" without a hash collision for:

18,446,744,073,709,551,616 ÷ 10,000,000 ÷ 86,4000 ÷ 365 = over 5800 years, on average.

But, in a computer vs. computer chess tournament in 2005, one program did make a catastrophic blunder, and, as it turned out, this was a result of a hash collision that had 64-bits worth of safety.

This goes to show one of the consequences of Quantum Mechanics: If you wait long enough, anything that can happen, will happen.

15. Nathan says:

Unfortunately, with the easy of use and scalability of free databases, there are more than one site that actually have several computers churning non-stop generating and inserting MD5 hashes so you can just type in an MD5 hash and it spits out the values if it has been generated. MD5 hashes of 9 out of 10 of my past passwords came up when I entered them… It makes using salted hashes all the more important.

16. kbrowder says:

Great article, good summary of basic hash security.  Salting was particularly interesting, many refrences forget about it.  At one point I read an article about using the hash of the passphrase (usually with a very fast and small hashing algorithm) as the Salt and then hashing the end resault (with a better hashing algorithm).  The author hypothesized this process would make it less feasible to brute force using the hash expecially since the algorithm can be called on the hash itself.  I’m not sure about the validity of the mathmatics behind this technique but it sounded interesting, whish I had the link.  The final product would look something like this:

result = hash(

…);

basically there would be many salts that are all dependent on each other, assuming the algorithm is safe it seems like it would be more secure since to the attacker the salt bits would be semingly random.  Interesting stuff.

17. whattheheck says:

> TechNet has a great article on "security myths"

http://www.microsoft.com/technet/technetmag/issues/2006/05/SecurityMyths/default.aspx

Good article, but they blew it by not recognizing that forced timed password changes weaken security.

After arguing that the policy of the forced password changes could be set to every 1,900,000,000,000 years without risking security, then they set it to 90 days.

Why? Clearly, the author just really likes annoying users, since he readily admits this serves no useful purpose.  That MUCH longer periods (much more than a lifetime) would still be secure.

Forcing users to pick new password every 90 days likely results in increasingly weaker (including shorter) passwords each time.

18. rape stories says:

Your article is prety nice. It’s a pity that i didn’t see it more later.

Another minor correction… you wrote: "the MD5 hash function always generates hash codes that are 32 bytes in length, the SHA1 hash function generates 20-byte hash codes".

Actually, MD5 results in a 16 byte (128 bit) hash code. The confusion probably came about because  that works out to 32 characters when represented in hexadecimal.

SHA1 is indeed 20 bytes, or 160 bits, or 40 chars in hex.

20. Bryan says: