What I've been working on lately that has kept me from doing nearly anything else can be found at:
MS-OFFCRYPTO is very detailed documentation of exactly how we do cryptography for binary and OOXML documents. Overall, it covers:
- Encryption and obfuscation
- Password to modify
- Digital signing
Before someone else figures out some of this and makes fun of us on Slashdot, let me be the first to detail what's really going on. Hang on – some of it is (in the immortal words of Warren Zevon) not that pretty at all. On a happier note, in the even more immortal words of Monty Python, "it got better" – what we're shipping now is quite good (for encrypting OOXML documents), and we have plans to make it even better.
Let's start with the worst of it – XOR. You may note that I consistently refused to ever say "XOR encryption", preferring the more accurate "XOR obfuscation". Not only is it the worst way to protect a document, but it was horrible to try and explain. We did all sorts of silly things to make this hard to figure out, it did nearly nothing to actually protect the data, but it sure was no fun to try and document in a normative style. I believe this obfuscation dates back to around 1994. Here's some pseudo-code to show you the sheer horror of it all – this is from one of the two password verifier approaches:
RETURNS 16-bit unsigned integer
DECLARE Verifier AS 16-bit unsigned integer
DECLARE PasswordArray AS array of 8-bit unsigned integers
SET Verifier TO 0x0000 SET PasswordArray TO (empty array of bytes)
SET PasswordArray TO Password.Length
APPEND Password TO PasswordArray
FOR EACH PasswordByte IN PasswordArray IN REVERSE ORDER
IF (Verifier BITWISE AND 0x4000) is 0x0000
SET Intermediate1 TO 0
SET Intermediate1 TO 1
SET Intermediate2 TO Verifier MULTIPLED BY 2
SET most significant bit of Intermediate2 TO 0
SET Intermediate3 TO Intermediate1 BITWISE OR Intermediate2
SET Verifier TO Intermediate3 BITWISE XOR PasswordByte
RETURN Verifier BITWISE XOR 0xCE4B
We'd have been much better off just taking a CRC16 of the password. I wanted to see just how bad this was, and wrote up a quick app to try the first 2^40 alpha passwords, and I started seeing cycles in the collisions. Values would go from under-represented to over-represented very quickly. Further inspection shows that for reasonably short passwords, you can immediately tell the number of characters from this value. Seems that for a 1 character password, the first 7 bits vary, 2 characters vary 8 bits, and so on.
Oddly enough, this is so bad that it actually has a benefit. There's so many collisions that while you may find a password that will work, there's no assurance that you found the password used to obfuscate the file, so you're not as likely to be able to get into other things by brute-forcing the 16-bit verifier. Somewhere around 8 billion passwords only generate about 16,000 verifiers, so there's literally hundreds of thousands possible passwords that could have created any given verifier.
What we have here is that someone who is actually a very good general dev (he's now a well thought of dev manager) who tried to roll his own crypto, and implement a simple hashing function. Moral of the story is DO NOT DO THIS.
If you look deeply into the obfuscation array initialization, you'll see another fairly ghastly mistake – the part of the array that coincides with the number of characters in the password actually varies, but the remainder of it is initialized based on values hard-coded into the binary along with values that end up in the document. This makes it possible to write a tool to directly extract quite a bit of information, and then there's the obvious disaster of what happens when you XOR chosen plain-text.
Our next attempt at encryption first showed up in Office 97, and featured RC4. As those of you who are familiar at all with encryption, RC4 is really hard to do correctly, and this is an example of most of the mistakes you can make with RC4. The number one rule of stream ciphers is to NEVER re-use a key stream, since the crypt text is the result of the cipher stream XOR'd with the plain text. If you reuse a key stream, you can XOR them to get the XOR of the two input plain texts, and that's often very easy to sort out. We did this more than once. Next rule of RC4 is that you have to have an integrity check, or you're subject to bit-flipping attacks – there's no integrity check. Finally, you should toss out the first 1k or so of the cipher stream, but we didn't know to do that.
Next, take into account that encryption was considered a munition at the time, and we were limited to 40-bit initial keys. These days, you can work your way through a 40-bit key space in minutes using only one computer. No need to bother with password cracking, just go directly to the key and attack it. The moral of this story is that agile encryption is a serious requirement, because while it may have taken some time to brute force 2^40 keys on a 286, a modern system will make short work of it – and in fact, it would be possible to store all the keys in a single file – it would only consume about 18 GB (much less if we did it as a tree of some kind). A second and more interesting flaw that our tester uncovered was that they thought they were doing an iterated hash (though only 16 iterations), but what they were doing in reality was concatenating the first 40 bits of the MD5 hash of the password, a 16 byte salt, and then repeating this concatenation 16 times. The old encryption library we used made this an understandable error, and it's harder to do with CryptoAPI or CNG. Moral of this story is that crypto is unlike any other code, and you should always get an expert to review what you're doing.
The RC4 encryption was then "strengthened" to use CryptoAPI, and could be configured to go to 128-bit keys, though unfortunately, the old 40-bit stuff was still default – this all happened in Office XP. Sadly, most of the implementation flaws remained. I found one place where there was triple key stream re-use (though only for 8 bytes) in the same spot. The unfortunate attempt at an iterated hash dropped back to a non-iterated hash of the password for reasons I don't understand. Some of the applications, notably PowerPoint, don't suffer as much from key stream re-use as others, and if you chose to use 128-bit RC4 and used a good password, a presentation could be relatively well protected.
As of Office 2007, we do warn you that the encryption we do on the binary documents is weak. Most of the time, it's so weak that it will only act as a mild deterrent. In some cases, we missed encrypting things entirely (which is actually called out in a KB article some time ago). My advice is that if you must encrypt a binary document, use a 3rd party tool to do it. These flaws are called out in the Security Considerations section of MS-OFFCRYPTO.
When we get to OOXML document encryption, the picture gets a lot better. I personally fixed some of the problems. We moved to AES, which is a block cipher, went to an iterated hash that far exceeds RFC2898 (50,000 cycles), and as I stated in a previous post, don't forget your password, or you may never see your data again. The only problems we had here was that we didn't support CBC (cipher block chaining) mode, and we didn't have an integrity check, though a 1 bit flip would result in 16 bytes of junk, so it isn't as high priority a problem as with RC4. We'll address both of these problems in the future, as well as some other improvements I'll talk about once we ship it.
If you take a good look at the ODF specification (as of 1.1) with regards to crypto, you see some of the same sorts of issues. They do some interesting things:
- The number of times to iterate on the password hash is relatively low (1000), and is a fixed value.
- The encryption algorithm must be Blowfish. Blowfish may well be a nice algorithm, but it hasn't been seriously validated, isn't FIPS compliant, isn't Suite-B compliant, and simply can't be used by some customers that require FIPS, Suite-B, or the analogues in their countries. Bruce Schneier himself has this to say about it – "At this point, though, I'm amazed it's still being used. If people ask, I recommend Twofish instead." It would be better to require an algorithm that has been validated, and better yet to allow it to be agile.
- The integrity check and the password verifier are the same thing. There is no way to know whether the user has the wrong password, or whether a bit got flipped. This has the potential for data loss, though I'd suppose one could build a special tool to just try the decryption.
The really good news is that we have some people who are seriously good with cryptography on the Office TWC team now – our tester is really sharp, we've got a PM who previously worked in the crypto group in Windows and is on the Microsoft-wide crypto board, and we have some devs who know this stuff as well. I'm happy to say that when you encrypt an OOXML document that it will be very hard to brute force the password and retrieve the information – and it will keep getting better.