When generating a random password, the result must still be a valid string


A customer had a problem with auto-generated random passwords. Their password generator generated a string by choosing each character randomly with a code unit from U+0001 to U+FFFF. (They avoided U+0000 because that is the string terminator.) They didn't mind that the resulting passwords were untypeable, because the passwords were going to be entered programmatically.

Things seemed to work great. One computer created the account with the untypeable password, and another computer was able to log in with that password.

But occasionally, they would find that the first computer would create the account, but the second computer couldn't use it to sign in. If they re-ran the password generator, then everything worked again. If they went back to the original password, it stopped.

They were occasionally generating haunted passwords.

If you take a bunch of randomly-generated code units, the result may not be a legal Unicode string. This is true for UTF-16LE (which is the default encoding used by Windows) as well as UTF-8.

What is going on is that occasionally, the random number generator will produce an invalid Unicode string, like say, a high surrogate not followed by a low surrogate. When the account is created locally, the UTF-16LE string is passed directly to the underlying service, which creates the account with the specified password as-is.

The string is then transmitted to the other computer, and the other computer tries to sign in with that password. However, the network protocol for the service specifies that the password is encoded as UTF-8 before being hashed or encrypted or whatever it is that network protocols do to protect passwords.

The problem is that an invalid UTF-16LE string cannot be converted to Unicode code points, and therefore cannot be re-encoded as UTF-8 for transmission on the wire. At best, you get U+FFFD REPLACEMENT CHARACTER, which says "Um, there was something here, but it wasn't a valid Unicode code point, so I have no way of expressing it."

The password goes out over the wire, and the server receives the UTF-8 string and transcodes it back to UTF-16LE, and the strings don't match because invalid strings do not round trip from UTF-16LE to UTF-8 and back.

The solution to the problem is to stop generating garbage strings that aren't even legal. They can generate the same amount of random data (preserving entropy), but convert it to Unicode via an encoding like base64 which is guaranteed to produce a legal string.

Comments (20)
  1. florian says:

    Another (vaguely related) tricky thing is that substitution of invalid UTF-8 sequences with U+FFFD REPLACEMENT CHARACTER during conversion from UTF-8 to UTF-16 expands their length from 1 to 3 bytes when converting back to UTF-8.

    This may undermine optimistic buffer size assumptions and result in clipped strings if relying on the cached original UTF-8 length when converting the string back to UTF-8.

  2. Gee Law says:

    I yelled out the surrogate issue as soon as I read past the second sentence.

    1. florian says:

      Well, using lone or swapped surrogates is the only way to generate invalid UTF-16 sequences.

      For UTF-8, on the other hand, there’s invalid lead bytes (those initiating actually allowed 2- to 4-byte sequences but resulting in code points beyond U+10FFFF, and those initiating completely invalid 5- to 8-byte sequences), wrong numbers of trail bytes following certain lead bytes, trail bytes with value ranges not allowed after certain lead bytes, or lone trail bytes. Not only may such sequences result in invalid code points in the narrower sense, but also in overlong forms, or code points from the surrogates range (which are reserved for UTF-16).

      That’s why programming with UTF-16 seems simple and light-weight compared to programming with UTF-8. Just my personal opinion.

    2. Good for you. You waited till the second paragraph. As soon as I read the title, I thought “Password complexity policy”.

      1. Quick D. McGraw says:

        It took me only up to the 4th word in the title to guess it was garbage-in/garbage-out… which was confirmed by the second-to-last word in the title.

  3. Joshua says:

    Another solution is to do what Cygwin finally had to break down and do–use WTF-8 as an interchange format. Cygwin’s reason was because Windows will permit filenames to contain invalid UTF-16, and since Unix programs have to deal with invalid UTF-8 in filenames anyway, handling of such filenames in Windows worked.

    1. Brian says:

      I got a kick out of “WTF-8”. Seems like a Freudian slip; I guess it’s the encoding mechanism that you use when you don’t really care if it works.

      1. Joshua says:

        I think it’s intentional acronym collision.

      2. Chrissielein says:

        Well, actually there is a WTF-8 encoding. https://simonsapin.github.io/wtf-8/

    2. Cesar says:

      This is also done by the Rust language. It uses UTF-8 for strings, but for talking with the operating system (mostly filesystem paths), it has a separate string type. On most operating systems, it’s an array of bytes (which might or might not be valid UTF-8); on Windows, it uses the WTF-8 encoding. In both cases, if the string is valid UTF-16 (on Windows) or UTF-8 (everywhere else), these “OS strings” will be valid UTF-8 (WTF-8 maps valid UTF-16 to valid UTF-8, and uses something else for invalid UTF-16).

  4. Jon says:

    Another solution is to cut back on the complexity and increase the length.

    You could scale all the way back to letters, numbers and normal punctuation, make the password 4 times the length, and it will still be unbreakable in reasonable time.

    1. GL says:

      You seem to have repeated the last sentence of the blog post…

  5. Nico says:

    This is such a terrible approach to the problem — it’s just begging for problems at every level. I can only assume that anyone claiming that they require the full unicode character space in order for their passwords to be secure really has no clue what they’re talking about and should be kept away from anything related to security.

  6. Keith Patrick says:

    “Create a random password generator” used to be an interview question my company would ask almost everyone. It was hugely surprising to me how challenging the question was to a large percentage of people, and it didn’t even get as deep as this post. Just the simple concept of generating random numbers (even a single seed wasn’t a deal breaker!) and creating alphanumerics from that weeded out a TON of candidates. But there were some somewhat out-of-box, less-than-idea solutions, too (hashing a username and/or base64-encoding it….returning part of a GUID without the dashes) (my favorite solution is “Use ASP.Net’s automatic generator” since it demonstrates some deeper knowledge of the underlying library we were using)

  7. Remy Lebeau says:

    Isn’t the real problem that the password generator was producing *code units* instead of *code points*? Had it been producing the latter instead of the former, then it could have encoded the generated values in whatever UTF it needed, and ended up with a valid string that round trips correctly when re-encoded between UTFs.

    1. I wouldn’t be surprised if a defective string didn’t normalize properly, though. E.g., starting with a combining character.

      1. florian says:

        Not sure, all I know is that no unassigned code points should be passed to normalization routines, but not sure about broken combining character sequences.

    2. florian says:

      Only “scalar” code points (U+0000 to U+D7FF and U+E000 to U+10FFFF) will round-trip properly, the 2048 surrogate code points still have to be excluded, as they’re reserved exclusively for use with UTF-16.

      1. DWalker07 says:

        Right, but if you generate UTF-16LE “code points” instead of “code units”, the result should survive a roundtrip through UTF8.

        1. But will it survive a round trip through Unicode normalization?

Comments are closed.

Skip to main content