Working around invalid characters in XML


Recently I’ve received a question about invalid/illegal/restricted characters in XML, and I thought I’d share some thoughts.


To begin with, the definition of characters in the XML 1.0 spec can be found at http://www.w3.org/TR/2006/REC-xml-20060816/#charsets. The definition is pretty straightforward, although what’s left out is as interesting as what’s being defined.


For example, a NULL character (\0) is not a valid XML character. So are other control characters in the lower end of the ASCII range. You can try the following JScript for fun.


function echoXmlTextErrors(xmlText) {
  WScript.Echo(“Parsing: ” + xmlText);
  var doc = createDocumentFromText(xmlText);
  echoDocumentErrors(doc);
}

function createDocumentFromText(xmlText) {
  var result = new ActiveXObject(“Msxml2.DOMDocument.6.0”);
  result.async = false;
  result.loadXML(xmlText);
  return result;
}

function echoDocumentErrors(document) {
  if (document.parseError.errorCode == 0) {
    WScript.Echo(“no errors found while parsing”);
  } else {
    WScript.Echo(“error: ” + document.parseError.reason);
  }
}

echoXmlTextErrors(“<howdy></howdy>”);
echoXmlTextErrors(“<howdy>\1</howdy>”);

// Output:
// Parsing: <howdy></howdy>
// no errors found while parsing
// Parsing: <howdy>?</howdy>
// error: An invalid character was found in text content.


So, what can you do if you need to record information that includes information in that range? Well, there are a number of options available, but one commonly used is the hexBinary representation, as defined in http://www.w3.org/TR/xmlschema-2/#hexBinary. At this point, you’ll be dealing with an octet representation rather than characters, which makes the XML content play nice with standards and different XML stacks.


Ideally you wouldn’t allow such control characters to be inserted in places where only text is expected, but if you’re not in a position to do that, a representation that more accurately reflects the fact that “hey, this might not actually be stuff you can render as text” will probably minimize headaches down the line.

Comments (0)

Skip to main content