Change to Unicode Encoding for Unicode 5.0 conformance

The behavior for UTF8Encoding, UnicodeEncoding and UTF32Encoding has changed in Windows Vista to conform better to the Unicode 5.0 requirements for Unicode Encodings. [23 July 2007: Now this behavior has also been made to .Net 2.0 with MS07-040 update applied.  See the list of known issues for MS07-040 described in KB 931212KB 940521 describes this behavior in particular.]


In .Net Framework V2.0 RTM we chose to respect the Unicode 4.1 standard which disallowed passing illegal UTF code points by dropping any bad data that was encountered, considering that this behavior would have the minimal impact to existing applications.


Since the .Net Framework 2.0 was released, the latest Unicode 5.0 specification has become stricter.  There was a concern that just ignoring invalid bytes could allow insecure hostile data because invalid characters would be dropped so and invalid string could become valid.  The new requirement for Unicode 5.0 is that bad bytes cannot be dropped, so we are now replacing them with U+FFFD, the Unicode Replacement Character, in Windows Vista, and future versions of the .Net Framework, including the .Net Framework 2.0 on Vista, and .Net 2.0 with the MS07-040 update applied.


The new default behavior is equivalent to setting the replacement fallbacks to “xFFFD” instead of the empty string.  If applications prefer the old behavior, they can create their UTF8Encoding with an EncoderReplacementFallback(“”) and DecoderReplacementFallback(“”), causing the fallbacks to drop the bad data.


Because of the +- and other oddities with “UTF-7″ its generally considered insecure anyway for similar reasons and UTF-8 is generally preferred.


FWIW:  My recommendation is that applications shouldn’t make trust decisions on encoded data, this goes for the other code page encodings as well as Unicode.  Encoding and decoding data can cause it to change its form.  (See Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided  for one example).  If your application needs to make sure that an input string doesn’t include C:windows, it should do the validation after decoding the data.  I’ll probably blog more about this later.


’til then,


Comments (3)

  1. A little over a year ago I wrote What’s with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? to…

  2. I hope to put some links to interesting posts about Code Pages/Unicode/Encodings here. Use Unicode! That

  3. If you’ve used hashing to store passwords for your application, you may want to double-check you code