UTF8 Security and Whidbey Changes

Unicode is always in the process of evolving, and some changes have been made to UTF8 in the last few versions.

The UTF-8 algorithm is fairly simple, but there are a few clarifications that are important for security reasons. Primarily there is the requirement that non-shortest form UTF-8 should not be permitted. The latest requirement is that encoding of individual or out of order surrogate halves should not be permitted.

Both of these requirements have their basis in security. The general understanding with any Encoding is that if there are multiple forms of encoding a character, then the likelihood that insecure data can disrupt or find security problems with software increases.

The illegal non-shortest forms of UTF-8 would allow characters to be encoded differently, which is why they are prohibited. For example if / or could be encoded as a 2, 3 or 4 byte sequence, then software trying to disallow such characters may miss them depending on how and when they do the validation. Additionally, if you took “shawn” and allowed another user to be named “shawn” with multiple byte encodings of each character, they could miss a blacklist or whitelist of allowed users and be erroneously allowed or denied access.

Windows or .Net applications that do processing internally in the native UTF-16 Unicode are less likely to experience this problem than some other platforms. Some operating systems and applications do their processing entirely in UTF-8 bytes, making this a bigger problem. Nevertheless, even well behaved .Net or Windows apps may be dealing with clients or other applications or libraries that could be susceptible to these issues. [13 July 2007: Note that on Vista and other OS's with MS07-040 .Net 2.0 changes illegal code points to U+FFFD instead of dropping them to comply with Unicode 5's security]

Broken surrogates are illegal in Unicode, so they introduce ambiguity when they appear in Unicode data. Again they could be used to create strings that appeared similar but were not really similar, particularly when applications ignore the bad data. Broken surrogates could be signs of bad data transmission or storage errors. They could also indicate internal bugs in applications or intentional efforts to find security problems. For example, strings could be parsed or truncated without concern for surrogate data, which could cause these problems.

.Net v2.0 (Whidbey) has been updated so that these illegal broken surrogates will be ignored or cause exceptions, depending on how the UTF8Encoding was created.

Its strongly recommended that applications use the UTF8 APIs exposed by UTF8Encoding or MultiByteToWideChar() instead of “rolling their own”. That allows applications to use a conforming UTF8 implementation and reduces the chance of nonconforming behavior between applications or as Unicode versions change.

It is worth noting that some applications have been taking random 16-bit data, such as the result of an encryption operation, and Encoding it as UTF8. Random 16-bit data isn’t necessarily legal Unicode, even if it is of a .net char[]. Therefore Encoding or Decoding it with any Encoding/Code page could encounter problems with illegal data sequences or edge cases. If you require conversion of binary or encrypted data to Unicode, choose a technique that generates valid Unicode J.