UTF-16, UTF-8 & UTF-32 update to conform with Unicode 5.0's security concerns.

My post Change to Unicode Encoding for Unicode 5.0 conformance now applies to .Net 2.0 with MS07-040 applied.  Updates include a list of known issues, please see the list of known issues for MS07-040 described in KB 931212 for more information.  KB 940521 describes this behavior in pandrticular.  This fix reduces the chance of spoofing similar strings.  Unicode 5.0 specifies this change due to security concerns regarding spoofing.

As mentioned in the KB:

Before this change, invalid characters in the middle of text strings would only be silently removed. For example, the string "AdxD800minxDC00istrator" would change to "Administrator" as the Unicode characters U+D800 and U+DC00 are invalid . This could cause a security problem for some programs. After you install the security update MS07-040, this string would now become "AdxFFFDminxFFFDistrator", and decode to "Ad�min�istrator" where the � is the Unicode replacement character.

The first time we introduced this behavior was in Vista, and since then I've received several reports of issues with the new behavior.  In nearly all of those cases there were usually some flawed assumptions contributing to the problems.  Some examples were:

  • Programs trying to convert byte[] arrays to Unicode (see Avoid treating binary data as a string) and then having problems when the data didn't round trip.  Note that prior to this change the data didn't round trip either, data was lost, but after the change it is more obvious since the FFFD's are present (which is the point of the security aspect of the change by the Unicode consortium).
  • Doing something like that, then trying to make a hash of the resulting value.  After the update the hash doesn't match.  Note that even prior to the update a very large number of values have the same hash, so this was not nearly as secure as the application had hoped.
  • Some applications made oopses with the behavior of Unicode, accidentally decoding extra byte(s) instead of pairs causing illegal UTF-16 or UTF-8.  Those were ignored and the app worked despite the bug, but the update prevents the error from working.

Note that before the update .Net 2.0 on Vista and .Net 2.0 RTM had different Unicode decoding behavior.  With the update applied they have the same behavior.

Hope this is helpful,

Shawn