Are Hash Codes Unique?

While you will hear people state that hash codes generate a unique value for a given input, the fact is that, while difficult to accomplish, it is technically feasible to find two different data inputs that hash to the same value. However, the true determining factors regarding the effectiveness of a hash algorithm lie in the length of the generated hash code and the complexity of the data being hashed.

Let's first talk about the hash algorithms themselves. Hashing algorithms generate a fixed-length hash code regardless of the length of the input. For example, the MD5 hash function always generates hash codes that are 32 bytes in length, the SHA1 hash function generates 20-byte hash codes, SHA256 generates 256-bit (32 byte) hash codes, and so on. Therefore, since there are a limited range of possible values for a given hash code and an unlimited range of values to hash, it stands to reason that the length of the hash code generated with a given hash algorithm is directly related to the difficulty of finding two inputs that will generate the same hash code.

This is easy to prove. If n is the number of possible hash codes, we only need n + 1 distinct input values to prove that there is an overlap. Granted that for most hash functions, n is rather large so we can safely assume that for any meaningful input values, it would be very difficult to find another meaningful input that would give the same hash.

That brings me to the second point—the input value itself. It is assured that if the input is anything more involved than random data, then even if you were to find two inputs that generated the same hash value, the two inputs would have no semantic relationship to one another. This is especially evident if you are hashing text, because there are many more ways to produce gibberish than there are ways to produce meaningful words. In other words, it's extremely difficult to create random words, even harder to form sentences, and very unlikely that those sentences will form a paragraph or a document and still have it make sense.

As an example, let's say you're generating a hash code for a top-secret document that details the route of nuclear warheads for disposal. The chances of someone randomly generating words until they matched the hash code generated for the original document and producing a document that makes any sense are so small that you could accurately say it would be impossible.

Therefore, the fact that each possible hash code doesn't uniquely match an input value only comes into play when you are dealing with random (nonstructured) data. An example of this would be spyware-removal applications that determine if a file is spyware by hashing each file in a specified folder or drive and then comparing that hash value to a list of known spyware-file hash codes. In this case, relying solely on the hash code would be a mistake as the files being hashed are of varying lengths with many files having no semantic meaning (as the application is not determining the meaning of the data; only the hash code values). As a result, it would be highly recommended that these applications either match on file name before hashing the file's contents (which dramatically reduces the possibility of "false positive" results) or use a hash code with a very long output value - such as SHA256.

This brings us to the concept of password hashing. Instead of storing their users' passwords, some applications will store only the hash code for the password. When the user attempts to log into the system, the application hashes the user's input password and compares that hash value to what has been stored for the valid password. This is a very good technique because it means that the users' passwords are not stored and therefore - theoretically cannot be hacked as hash codes cannot be reverse-engineered to their pre-hashed values. However, assuming that a hacker got his hands on the password-hash file, a brute-force method could be employed by the hacker to continually generate hash codes until a code is generated that matches a hash code on the password list. The value the hacker used to generate the matching hash code could then be used to allow the hacker unauthorized access.

This is a very real risk as passwords generally have a fixed length (such as 6-10 characters) . Therefore, the hacker doesn't need to generate as many hash codes as he would if attempting to regenerate the same hash code that was originally hashed for a much longer value. As a result, anyone using this technique to protect user passwords should go a step further by adding a salt, or static piece of data, to the input value before it is hashed. This would produce an almost fool-proof system as even if someone were to obtain a list of hashed passwords and were to produce a password that generated to a hash code in the list, this password would not work because when attempting to log in, the system would apply the salt and the resulting hash code would differ from that stored for the user. In other words, the hacker's input value would already be salted, but the system is expecting a non-salted value that it salts. As a result, the addition of the salt greatly increases the difficulty in cracking the passwords, because now the hacker would need to 1) steal the hash-code file, 2) know that a salt has been added, 3) know the value of the salt and 4) know exactly how the salt was added.

Therefore, while there isn't a one-to-one correlation between every possible hash code and every possible input value such that all combinations are guaranteed to be unique, hash codes are an extremely reliable means of protecting data integrity.

Related Reads: Keith Brown did a full article on securing the username token with WSE 2.0 that is a good read even if you're not doing Web Services as it talks about general security issues regarding the protection of user authentication information that is being sent over the wire.