Evil encoding configuration

This post is the final one for the XmlWriter and StreamWriter encoding series (see parts one and two first).

If you ran the original example, you'll notice that the last combination of encodings we wrote produced this output.

Stream Encoding: (no stream)
  XML Encoding:  Norwegian (IA5)

Unhandled Exception: System.Text.EncoderFallbackException: Unable to translate Unicode character \u0023 at index 55 to specified code page.
   at System.Text.EncoderExceptionFallbackBuffer.Fallback(Char charUnknown, Int32 index)
   at System.Xml.CharEntityEncoderFallbackBuffer.Fallback(Char charUnknown, Int32 index)
   at System.Text.EncoderFallbackBuffer.InternalFallback(Char ch, Char*& chars)
   at System.Text.SBCSCodePageEncoding.GetBytes(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, EncoderNLS encoder)
   at System.Text.EncoderNLS.Convert(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, Boolean flush, Int32& charsUsed, Int32& bytesUsed, Boolean& completed)
   at System.Text.EncoderNLS.Convert(Char[] chars, Int32 charIndex, Int32 charCount, Byte[] bytes, Int32 byteIndex, Int32 byteCount, Boolean flush, Int32& charsUsed, Int32& bytesUsed, Boolean& complet
   at System.Xml.XmlEncodedRawTextWriter.EncodeChars(Int32 startOffset, Int32 endOffset, Boolean writeAllToStream)
   at System.Xml.XmlEncodedRawTextWriter.FlushBuffer()
   at System.Xml.XmlEncodedRawTextWriter.Flush()
   at System.Xml.XmlWellFormedWriter.Flush()
   at Cs.Cs.WriteXml(XmlWriter xmlWriter)
   at Cs.Cs.WriteEncodedXml(Encoding streamEncoding, Encoding xmlEncoding, Stream stream)

This is very much an edge case, and it takes a couple of minutes to figure out what's going on.

Let's start from the code that produces this problem:

  Encoding muhaha = Encoding.GetEncoding(
    new EncoderExceptionFallback(),
    new DecoderExceptionFallback());

This encoding is built up as follows. First, it specifies the x-IA5-Norwegian encoding. Then, it specifies that if it cannot map a character to this encoding, it should throw an exception. Typically you can configure encodings to fall back to writing a best-fit character or a generic '?' character, but depending on your system, this may be the wrong thing to do - think, for example, that you cannot reliably round-trip data any more. So, to be safe, we're setting the encoding to fail if that's unavailable.

Now, XML has a pretty nifty way of dealing with characters that cannot be directly encoded, by using character references. This allows you to write 	 instead of a tab character, for example. So even if the encoding doesn't support a character, XML can still represent it using this escape hatch.

And there's the rub. x-IA-Norwegian is one of the extremely rare encodings that doesn't have the '#' character in it's repertoire! So when the XML writer sees a character that's not in the encoding (I used '#' itself for extra irony points), it tries to write the reference, and then the encoder fails again to write '#'. At that point, the writer gives up and allows the exception to bubble out unhandled, which in our simple program just print out the exception to the console.


Comments (3)
  1. bruce says:

    I think your post may have broken the msdn blogs rss feed – ROME breaks with an invalid xml character exception. Try parsing http://blogs.msdn.com/MainFeed.aspx and you’ll see what I mean..

    Apologies if I’m wrong.

  2. @bruce, I gave this a quick try and IE was happy with it, although a warning did pop up in a later line.

    Now it’s gone from the feed page, two other feed readers have been happy with the .RSS (can’t say about the .aspx page though).

    Let me know if this is still a problem and I’ll look into this…

  3. GreenNet says:

    I don’t have any problems with it…

Comments are closed.

Skip to main content