Kirk Evans Blog

.NET From a Markup Perspective

Postel again? Writing Well-Formed XML with XmlWriter

Recently, I was forced to reconsider my stance on Postel’s law again.  I was provided an “XML“ feed that was not well-formed, and did not have the luxury of simply rejecting it due to the multiple well-formedness violations it contained.

Side note:  it is a pet peeve of mine when people make quote signs in the air.

Specifically, the “XML” contained several characters that XML 1.0 explicitly forbids, like x008 and x016..  Instead of throwing an exception, as done in the Customized XML Writer Creation topic on MSDN, I was forced to process the “XML” and write portions of it to a new stream, ensuring that it is well-formed.  The solution?  A custom XmlWriter that overrides the calls to WriteString.  The overridden call to WriteString uses a regular expression to determine if the text contains any of the characters not in the Char production in XML 1.0.  If any character is encountered that is not within that production, it is escaped with a character entity.

using System;
using System.IO;
using System.Xml;
using System.Text;
using System.Text.RegularExpressions;

namespace XmlAdvice
{
 
/// <summary>
 /// An XmlTextWriter that strives to ensure the generated markup is well-formed.
 /// </summary>
 public class XmlWellFormedTextWriter : System.Xml.XmlTextWriter
 {    
  Regex _regex =
new Regex(@“[\x01-\x08\x0B-\x0C\x0E-\x1F\xD800-\xDFFF\xFFFE-\xFFFF]”);

  
/// <summary>
  /// Creates an instance of the XmlWellFormedTextWriter
  /// </summary>
  /// <param name=”stream”>The stream to write to.</param>
  /// <param name=”encoding”>The encoding for the stream (typically UTF8).</param>
  public XmlWellFormedTextWriter(Stream stream, Encoding encoding)  : base(stream,encoding)
  {  
  }

  
/// <summary>
  /// Replaces any occurrence of characters not within the production
  /// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
  /// with a character entity.
  /// </summary>
  /// <param name=”text”>The text to write to the output.</param>
  public override void WriteString(string text)
  {
   Match m = _regex.Match(text);

   
int charCount = text.Length – 1;
   
int idx = 0;

   
if(m.Success)
   {
    
while(m.Success)
    {
     
base.WriteString(text.Substring(idx,m.Index – idx));
     WriteCharEntity(text[m.Index]);
     idx = m.Index + 1;
     m = m.NextMatch();
    }
    
if(idx < charCount)
    {
     
base.WriteString(text.Substring(idx, charCount – idx));
    }
   }
   
else
   {
    
base.WriteString(text);
   }
  }
 }
}

Update:  Dare Obasanjo corrected me.  In fact, this example does not assist in creating a well-formed XML document, instead it postpones the problem until the XML document is parsed again.