Further Strengthening Hash Algorithms

There's been an interesting internal email thread going on this week about the use of a hash to uniquely identify some data. The two main sources of weakness in a hash algorithm are: 

  1. Collisions - the chance that two unique inputs will produce the same output.
  2. Aliasing - the chance that an input can get the hash algorithm to its start state

Problem number one is well known, and is generally less of a problem as the bit size of the output hash increases (ie, you're less likely to have a collision with a 160 bit hash than you are with a 128 bit hash). Incidentally, it's also possible to have a collision during the calculation of a hash value. This is known as a pseudo-collision. For instance, MD5 is said to have pseudo-collisions during some of its stages of calculation. MD4 was discovered to have collisions if either the first or last stages of calculation were left out, and eventually collisions were found for the entire algorithm. Cryptographic hash algorithms aim to make it as difficult as possible to create a collision, meaning that their goal is to prevent someone who knows a hash value from being able to produce some data that will create the same hash value. Some hashes, mostly checksums, are actually designed with the reverse goal in mind. For instance, error correcting codes have the goal of allowing me to see the code and the data, and if they don't match, modify the data so that they do.

Aliasing is a bit more of an interesting attack on hashing algorithms. Basically the goal of someone mounting an aliasing attack is to find some sequence of inputs that when fed to the hashing algorithm will reset the internal state back to the start state. With this information, the attacker can make any message they want have the same hash as the original. For instance, if it's found that a 1 kilobyte sequence of 0xFF bytes cause the internal state of a specific algorithm to be the same as the start state, the attacker could intercept a message, put whatever data they like at the beginning of it, follow it up with the kilobyte of 0xFF, and then put the original message at the end. Since the 0xFF bytes caused the hashing algorithm to in effect start over, the hash of the modified message will be the same as the hash of the original one. A successful aliasing attack creates a collision, since it will lead to many possible inputs with the same hash value.

Cryptographic hash algorithms are designed to make attacks against them as difficult as possible. The strength of using a cryptographic hash to verify that some data has not been modified increases dramatically when you combine the hash value with some known information about the data. For instance, even if I know an aliasing attack against an algorithm, using it will cause the resulting message to increase in size. If my hash is over a file on the file system, and I know the hash value as well as the size of that file, the aliasing attack will not work. This leaves coming up with a brute force collision as the only viable option. (Assuming of course that the attacker cannot modify my knowledge of the size of the file).

Attacks become more difficult still if the knowledge about the message is more than just a size. For instance, if I'm expecting to find an XML file, then whatever aliasing attack that is going to be mounted on me better consist of a sequence of bytes that happens to be valid XML -- not exactly something that will be easy to come up with. Any brute force collision attack had also better consist of valid XML. This gets even harder if I have a schema that the XML has to validate against. Now attackers must not only come up with an attack that consists of valid XML, it also has to match my specific schema, something that would be almost impossible to do.

Other examples of using known information to increase the difficulty of an attack would be insisting on a valid file format (for instance a valid Access database or valid JPEG image), or just that the message be of a known format. For instance, if you know a message is always going to be of the form “Meet me at xx:xx AM” or “Meet me at xx:xx PM”, it would be extraordinarily hard for an attacker to replace the time with a different one.

In summary, combining a cryptographic hash (composed of a large number of bits) with some known information about a message can greatly reduce your chances of being fooled by an attacker into accepting a modified message instead of the message originally intended for you.