Randomly-generated passwords still have to be legal strings

If you need to generate a password for programmatic use, then you don't have to worry about generating characters that are difficult or impossible to type on a keyboard. Go ahead and mix Cyrillic with Vietnamese and throw in some Linear B while you're at it. There is no keyboard that can type all of these characters, but it doesn't matter because nobody will be typing it.

However, you should make sure that your password is a legal string.

We generate our password from a cryptographically secure random number generator. Basically, we take 256 random bits and treat them as sixteen 16-bit values. (If one of the 16-bit values is zero, then we ask for 16 more bits.)

We found that sometimes (no predictable pattern), we have interoperability problems between systems. The password produced by one system is not recognized by the other.

After much investigation, the problem was traced back to the fact that taking a bunch of non-null 16-bit values and declaring them to be a Unicode (UTF-16LE) string does not always result in a valid Unicode string.

UTF-16 has the concept of surrogate pairs, which encode characters outside the BMP as a pair of 16-bit values. The first entry in the pair is a high surrogate in the range 0xD8000xDBFF, and the second is a low surrogate in the range 0xDC000xDFFF. Together, they encode a character in a supplementary plane.

If your randomly-generated string contains a value in the range 0xD8000xDFFF, then unless you are very lucky, it will not be part of a valid surrogate pair. The string is therefore not well-formed, and various parts of the system might decide to reject them with ERROR_INVALID_PARAMETER, or they might "fix" the problem by changing the illegal values to U+FFFD, the Unicode Replacement Character, which is used for unknown or unrepresentable character. For example, if the protocol specifies that the password is transmitted in UTF-8, then the presence of an unpaired surrogate causes the conversion from UTF-16 to UTF-8 to fail, and consequently, the password fails to replicate to the other machine.

If you want to generate a random password, make sure your algorithm produces legal character sequences. A simple solution is to generate the desired amount of entropy, then hex-encode it. Yes, it isn't very space-efficient, but it gets the job done. (Assuming you don't have to meet password complexity rules.)

Comments (13)
  1. Andreas says:

    Another option if you need to meet certain password complexity rules is to pick characters randomly from a well-defined alphabet where all characters are accepted on all systems (e.g. ASCII). The amount of entropy can then be controlled by the length of the password.

  2. The MAZZTer says:

    Base 64 is about halfway between hexadecimal and raw byte data in terms of efficiency, and it’s valid ASCII. It’s pretty much made for exactly this type of use case.

    Of course it doesn’t matter too much how you encode it as long as it’s encoded. Less efficient encoding schemes just become longer strings, but it’s the same amount of password complexity (more so if an attacker doesn’t know what encoding you’re using).

    1. Brian_EE says:

      The first rule of cryptography is that you assume the attacker knows your algorithm.

    2. Josh B says:

      Base64/Base85/yEnc means that unless you get phenomenally unlucky, you’re essentially guaranteed to pass high complexity tests… and fail overly restrictive ones that deny special characters or have other stupid unnecessary limitations. There is no perfect solution to password interop, just pick what works and have a fallback.

  3. French Guy says:

    You’d also need to exclude all the invalid code points: 0xFDD0-0xFDEF, 0xFFFE and 0xFFFF.

    1. henke37 says:

      Probably should exclude the private use area too.

      1. mark72 says:


        1. Kevin says:

          On general principle. The receiving system might do something stupid with it.

          There’s actually a lot of characters you need to exclude because the receiving system might be stupid. For example, the receiving system might incorrectly assume it’s safe to compose or decompose all diacritical marks, which is particularly likely since the Unicode Consortium encourages this behavior (NFC/NFD). The receiving system might also have problems with control characters, some unusual kinds of Unicode whitespace, characters which cannot be losslessly capitalized and lowercased (or vice-versa), and so on. The receiving system might even have problems with any non-ASCII character, if it tries to do some kind of home-grown hashing algorithm and stuffs characters into chars (instead of wchar_t or whatever the Windows equivalent is called).

          It’s probably safer to just use Base64 or some other ASCII encoding. That way, you can mostly just ignore these issues.

  4. jnm236 says:

    The higher-level issue that I see is that the password is being generated at the byte level but being parsed at the code-point level. If you’re supposed to be producing code points, generate random code points and not random bytes.

  5. ga says:

    Or Base64-encode it, or Base85-encode it. I personally use a manually curated list of characters so I can avoid similar looking ones like I, l, 1.

  6. Nathan Fritz says:

    Base64 encoding works as well, and is more space efficient than hex.

    1. Joshua says:

      Thus spake the master programmer.

Comments are closed.

Skip to main content