OK, we want to sketch out an authentication system which is sufficiently secure against common attacks even if all the details of the system are known to the attacker. Let’s start with a simple system, take a look at what its vulnerabilities are, and see if we can mitigate them:

**System #1**

The client transmits the username and password “in the clear” to the server. The server consults a list of usernames and passwords, and grants or denies access accordingly.

There are two big problems with such a system. First, it’s susceptible to eavesdropping if the client and the server are not the same machine. If someone can listen in on the message sent from the client to the server with a network packet sniffer then the eavesdropper learns a valid username and password. Second, if the server is ever compromised and the password list is read, the attacker learns everyone’s username and password.

Can we mitigate these problems? Let’s look at the second vulnerability first. Does the server need to store the password? How could it NOT store the password? What would it compare against?

We can eliminate the need for the server to store the password if we have a **hash function**. A hash function is a function which takes a string as its argument and produces a number, usually a hundred–or-so-bit integer, as its result.

A good hash algorithm has the property that slightly different inputs produce very different outputs. A one-bit change in the input should cause on average 50% of the output bits to change. Because of this, it should be extremely difficult to deduce the an input that produces a given output, even given partial information about the input. (There are other hash algorithm properties that are important for cryptographic operations such as document signing, but that’s a topic for another day.)

Proving that a given algorithm actually has this property can be quite tricky, but we have some industry-standard hash algorithms which have withstood rigorous testing and deep analysis by professional cryptographers and are widely believed to be “one way functions” — it’s easy to go from string to number, and very hard to go backwards.

**System #2**

The client sends the username and password to the server. The server has a list of usernames and the hashes of passwords, but not the passwords themselves. (When the user first created the password, the system hashed the password and then discarded it, saving only the hash.) The server hashes the client-supplied password and compares the hashes.

This is better; now the server is not storing the passwords, just the hashes. If an attacker compromises the server, they can’t easily go from the hash back to the password. (It also has the nice property that every entry in the password table is now the same size. Longer passwords do not have longer hashes.)

But there are two new problems with this system. First, any two users who have the same password have the same hash. If one of those users is evil and compromises the server, they immediately learn who has the same password as they do.

Second, this system is susceptible to dictionary attacks. An attacker can hash every word in a large dictionary, compromise the server, and then compare every password hash to every word in the dictionary. Since dictionary words are likely passwords, the attacker will probably be able to figure out at least a few passwords.

And of course we still haven’t mitigated the fact that eavesdroppers could be listening in on the conversation between the client and the server.

Next time, we’ll add a little salt to the mix in an attempt to mitigate the dictionary attack and same-password vulnerabilities. Then we’ll see if we can use some of the hash technology to mitigate the eavesdropping attack.

You state that System #2 introduces two new problems. But, it seems to me that these ‘new’ problems are also present in System #1. At least, I see no reason to assume System #1 is impervious to dictionary attacks, and if a user compromises the server, he can match his password to other users in the password file just as easily (if not moreso) than if the system used a hash.

I suppose adding a hashing function could add a new point of attack (probing it for mathematical weaknesses, etc.) but other than that, what other security problems could it cause?

Anyways, I’m looking forward to part 3.

Sure, those problems are there, but if the passwords are in the clear, they’re trivial. You don’t even think "I can find out if someone else has my password" if you can already just read everyone’s password! You don’t think "I can compare everyone’s password against a word list" if you already have everyone’s password!

Note that by "dictionary attack" I mean getting ahold of the password file and "offline" attacking it with a dictionary. I don’t mean sitting there online trying dictionary word after dictionary word against the server — all shared-secret password systems are vulnerable to that attack. No amount of encrypting the back end list is going to mitigate against that attack. Rather, that attack needs to be mitigated by detecting it, and doing something like slowing down the server response time after the tenth bad password, for instance.

Hashing adds mathematical weaknesses, as you note. It also makes dictionary attacks cheaper. And there is also the problem of password equivalence — in some systems, as we’ll see, if you have the hash you don’t NEED the password, so why bother doing a dictionary attack?

I’ve never understood how a mathematical algorithm can be one-way. If all the details of the system are known, that includes the steps to create the hash, right?. How can you not take a hash, reverse the order of the calculations, and arrive at the original password (or a set of passwords that all hash to the same value)?

> How can you not take a hash, reverse the order of the calculations, and arrive at the original password (or a set of passwords that all hash to the same value)?

The operation has to be either fundamentally algorithmically hard to reverse, or information losing.

For instance, take two hundred digit prime numbers, multiply them together. That’s a one-way algorithm. It is computationally infeasible to go the other way — start from the composite and factor it into the original inputs. That’s the basis of the RSA cryptosystem.

Or, consider checksumming algorithms. Say, take a string, add up all the ASCII values of its characters. That’s information losing! There are billions upon billions of strings that have a given sum.

Coming up with unreversable operations is easy. Coming up with unreversable operations which have nice properties like "collisions are rare" and "small changes in input produce large changes in output" is rather harder.

When I first learned about Kerberos, I thought it was an overly complex, rather inefficient protocol; I didn’t understand why when NTLM and hashed basic auth. already existed.

Fast forward a year, with a great deal more knowledge of crypto and authentication, and I discovered v5 is a breathtakingly awesome feat. Properly implemented (which seems to be the killer) it solves all MITM and cracking problems as much as currently feasible.

A salt just forces cracking to concentate on one password at a time – it does nothing to increase the security of a single entity, but it increases the time to attack an entire system by multiplying time for one by the number of accounts attacked, compared to negligible extra time for unsalted passwords.

Indeed, Kerberos is a thing of shiny beauty, though of course, like all security systems, it trades some convenience for greater security — you’ve got to have a mutually trusted third party, you’ve got to ensure that everyone’s clocks are synchronized, etc.

Maybe I’m being dense here (and just say so if I am) but if you know all the details of the hash – meaning you know the prime numbers being multiplied – then how is that operation not reversible?

Similar for information loss, that just means you end up with multiple solutions instead of one (just like when you integrate 2x dx you get x^2+C with infinitely many C’s). If you know the hash algorithm is "sum of the ASCII values mod 256" and the hash value is 37, I can easily reconstruct "AAAb" as a password, one of many that hash to 37.

This has to be covered in a FAQ somewhere… ðŸ˜‰

"take two hundred digit prime numbers, multiply them together. That’s a one-way algorithm. It is computationally infeasible to go the other way "

Why do the numbers have to be prime? Wouldn’t this be true of any two hundred digit numbers?

> if you know all the details of the hash – meaning you know the prime numbers being multiplied – then how is that operation not reversible?

What if you DON’T know them? What if all you know is that the algorithm is "generate two hundred-digit prime numbers and multiply them", and the output of the algorithm?

> Similar for information loss, that just means you end up with multiple solutions instead of one

Right. For the "sum the ascii values", one in every 256 strings hashes to a given hash. This would be a bad password hash because even if you didn’t know the hash, you’d only have to try a few hundred possibilities to find a collision. If you knew the hash, you could easily construct a collision because the algorithm is so simple.

Cryptographic hash algorithms are specifically designed so that (a) collisions happen roughly every 2^128 strings, not every 2^8 strings, and (b) there is no algorithm known to be faster than "try every possible string" to look for collisions. If you’re going to have to try 2^128 strings, it’s going to take some time no matter how fast your box is.

A machine that could hash a trillion — 2^40 — strings a second would take 2^88 seconds. That’s 2.5 million times the age of our little planet. Get a billion such machines working together and you’re down to only ten billion years of processor time.

Now, maybe you think that there’s a weakness in the SHA or MD5 hash algorithms, and that you can design an algorithm which takes a hash and rapidly generates strings which have that hash. If so, there’s a chair waiting for you at the math department of your choice!

The MD5 faq is here:

http://www.faqs.org/rfcs/rfc1321

and the SHA1 faq is here:

http://www.faqs.org/rfcs/rfc3174.html

> Why do the numbers have to be prime? Wouldn’t this be true of any two hundred digit numbers?

Yes and no. If the question is "find the two numbers that I had in mind", then yes, this is an even harder problem because it is information losing. If the product is 36, did I start with 1 x 36, 2 x 18, 3 x 12, … ?

But if the question is just "find a factorization" then primeness is important. Factoring 2^250 x 3^200 is trivial. Finding a factorization is hard if all the factors are very large. A number which is the sum of two large prime numbers is very hard to factor.

In your last comment Eric, I think you mean, "A number which is the _product_ of two large prime numbers is very hard to factor."

Yep. Clearly I am typing faster than I am thinking…

For anyone who wants a little more mathematical detail: for very small numbers (up to 10 digits or so) using trial division to see if it has any prime factors it just about as efficient as any other method. For bigger numbers, however, its just not efficient.

There are some very good (ie, fast) algorithms that can determine if a number is prime. These algorithms are usually based on OTHER interesting properties of prime numbers (NOT by finding their factors). For example, a program called PRIMO can certify a 4769-digit prime in approximately 2000 hours of computation on a 1 GHz processor:

http://mathworld.wolfram.com/PrimalityTest.html

Once you actually determine that a (huge) number is composite, that’s the easy part. Actually finding the prime factors of a huge number is exponentially more difficult. An optimized technique called eliptic curve factorization is one way to do it. The current "champion" for this technique, however, is someone that found a 58 digit prime factor of a 141 digit number

(http://www.loria.fr/~zimmerma/records/ecmnet.html). That’s about 1/40th the size of the number than can be prime tested, and I guarantee you that this took many more than 2000 computation hours.

The bottom line is that if you randomly generate 2 100-digit primes (remember, its very easy to test a large number to see if it is prime) and then multiply them together, you end up with a 200 digit number that is simple impossible to factor using today’s technology. Ok, get a couple of years of supercomputer time, and you might get close, but hopefully by that time you’ve changed your password ðŸ™‚

MD5 is insecure. Why?

If you manage to get a collision at one stage, then the collisions propogate going forward along subsequent rounds.

Recent studies have shown how to do this for practical examples, I’m sure you can Teoma to find them.

For now, make sure you use both MD5 and SHA-1 as a temporary countermeasure. Longer term, let’s develop some new, better hashes (without MS input please, we don’t want patents on mathematics).

Teoma them? Don’t you mean A9 them? ðŸ™‚

Fabulous Adventures in Coding assays this week a, well, fabulous adventure, in simple cryptography. I know enough to get myself in deep trouble with this subject, but Eric has put together three short and knowledgeable posts that begin easy and…

"A number which is the sum of two large prime numbers is very hard to factor"

I decree that 2 is one factor of that sum.

The blog is very useful.

A recent question I got about the .NET CLR’s hashing algorithm for strings is apropos of our discussion

>> A good hash algorithm has the property that similar inputs produce different outputs.

If that’s the case, how can you be sure that the hash of the password submitted by the user attempting to log in will match the hash of that password that was stored in the database?

By "similar" I mean "similar but not identical". That is, the hash for "banana123" should be completely unrelated to the hash for "bananb123". Clearly the hashes of two identical inputs must be identical!

Sorry for the confusion.