XmlWriter, encodings and BOM

Today I want to talk about XmlWriter and the generation of a Byte Order Mark (BOM).

XmlWriter provides an API that generates, unsurprisingly, XML. This XML will typically end up as a managed string of characters or possibly a sequence of bytes. Of course, text transformed into bytes implies an encoding, as previously discussed.

Now XML has its own ways of determining the encoding that a document has, by peeking at the first bytes that make up an opening <?xml declaration or, more explicitly, with the encoding on this declaration.

Unicode is used for all sorts of puposes, not just XML encoding, and so it also has a mechanism to distinguish between small-endian and big-endian encodings, which determine which byte comes first in UTF-16 and UTF-32. It's also allowed for UTF-8, for that matter.

How do these mechanisms interact when using the .NET Framework classes? Let's write some code!

First, we'll write a short helper method to display the contents of a byte array.

 private static void ShowBuffer(string linePrefix, byte[] bytes, long length) {
  int bytesOnLine = 0;
  for (long i = 0; i < length; i++) {
    if (bytesOnLine == 0) {
      Console.Write(linePrefix);
    }

    Console.Write("{0:X2} ", bytes[i]);
    bytesOnLine++;
    if (bytesOnLine > 16) {
      Console.WriteLine();
      bytesOnLine = 0;
    }
  }
}

Next, let's write a method to write out some short XML.

 private static void WriteXml(XmlWriter xmlWriter) {
  xmlWriter.WriteStartElement("hello");
  xmlWriter.WriteString("#1");
  xmlWriter.WriteEndElement();
  xmlWriter.Flush();
}

Wel'll try different combinations of layering an XmlWriter with some encoding over a StreamWriter with a different encoding (or directly over a stream) to see what happens. These two methods will help us out.

 private static long WriteEncodedXml(
    Encoding streamEncoding,
    Encoding xmlEncoding,
    Stream stream) {
  XmlWriterSettings settings = new XmlWriterSettings();
  settings.Encoding = xmlEncoding;
  settings.Indent = false;

  if (streamEncoding != null) {
    using (StreamWriter writer = new StreamWriter(stream, streamEncoding))
    using (XmlWriter xmlWriter = XmlWriter.Create(writer, settings)) {
      WriteXml(xmlWriter);
      return stream.Length;
    }
  } else {
    using (XmlWriter xmlWriter = XmlWriter.Create(stream, settings)) {
      WriteXml(xmlWriter);
      return stream.Length;
    }
  }
}

private static void ShowXmlEncoding(
    Encoding streamEncoding,
    Encoding xmlEncoding) {



  Console.WriteLine("Stream Encoding: " +
    ((streamEncoding == null) ?
    "(no stream)" : streamEncoding.EncodingName));
  Console.WriteLine("  XML Encoding:  " + xmlEncoding.EncodingName);

  MemoryStream stream = new MemoryStream();
  long length = WriteEncodedXml(streamEncoding, xmlEncoding, stream);
  byte[] bytes = stream.GetBuffer();
  ShowBuffer("  ", bytes, length);
  Console.WriteLine();
}

Finally, here is the method to drive it all.

 public static void Main(string[] args) {
  // First encoding is for stream writer, second is XML writer.
  ShowXmlEncoding(null, Encoding.UTF8);
  ShowXmlEncoding(null,
    new UTF8Encoding(/* encoderShouldEmitUTF8Identifier  */false));
  ShowXmlEncoding(null, Encoding.Unicode);
  ShowXmlEncoding(null, Encoding.BigEndianUnicode);

  ShowXmlEncoding(Encoding.ASCII, Encoding.Unicode);

  // Muhaha.
  Encoding muhaha = Encoding.GetEncoding(
    "x-IA5-Norwegian",
    new EncoderExceptionFallback(),
    new DecoderExceptionFallback());
  ShowXmlEncoding(null, muhaha);
}

You can run this now and see what comes up. Tomorrow, a short analysis of some interesting results.

Enjoy!