Text Encoding and WCF

For the most part setting the encoding (UTF-8, UTF-16, etc.) on a WCF service is pretty simple:  set textEncoding on basicHttpBinding/wsHttpBinding or encoding on a textMessageEncoding in a customBinding.  It works in almost all cases, but there’s a surprising number of ways of indicating the encoding of message:

  • the content type of the message
  • an XML declaration at the start of the message
  • a BOM (byte order mark) that is the first 2 or 3 bytes (depending on encoding) of the message

Next an encoding of UTF-16 doesn’t specify a single encoding.  It could be either big endian or little endian.

We can glean a few rules from RFC 3023 - XML Media Types:

  1. The encoding in the content type should take precedence over any other encoding indications.  If the content type is UTF-8 but the XML declaration is UTF-16, the message should actually be UTF-8.  For example, a router could receive a UTF-16 message and reencode it as UTF-8 to save bandwidth.  The router would change the content type but would keep the actual text of the entire message unchanged.
  2. When the encoding is specified as UTF-16, there must be a BOM.
  3. When the encoding is specified as UTF-16LE or UTF-16BE, there must not be a BOM.
  4. An XML declaration is optional in the presence of other indications of encoding.

The TextMessageEncoder that ships with WCF is a little too strict in a couple of ways:

  1. When encoding is specified as UTF-16, it requires an XML declaration.
  2. The content type and the encoding specified in the XML declaration must be consistent.

This can lead to problems in interoperability.  Other web service clients can produce messages violating those WCF restrictions and the result is a 400 Bad Request with no additional information.  To work around this we will need a custom message encoder.

First we can start with the CustomTextMessageEncoder sample.  This sample allows you to specify any .NET supported encoding for an endpoint instead of just UTF-8 and UTF-16 but doesn’t do anything about the restrictions above.  Additionally it is necessary to specify which encoding you’ll use when deploying your endpoint.

The new text message encoder I’ll show will accept UTF-8 and UTF-16 encoded messages and workaround those restrictions.  I’ll focus on the message encoder since that is where the action happens, but there is additional boilerplate necessary.  You will also need a MessageEncoderFactory (called by the framework to create instances of the MessageEncoder), MessageEncodingBindingElement (used to plug this into the channel stack and allow the framework to create instances of the MessageEncoderFactory) and BindingElementExtensionElement (allows it to be used in config, this is optional).  The CustomTextMessageEncoder sample has excellent examples of each of these.

Let’s start with ReadMessage:

    1: public override Message ReadMessage(ArraySegment<byte> buffer, 
    2:                     BufferManager bufferManager, string contentType)
    3: {
    4:     this.contentType = contentType;
    5:     
    6:     byte[] msgContents = new byte[buffer.Count];
    7:     Array.Copy(buffer.Array, 
    8:                 buffer.Offset, msgContents, 0, msgContents.Length);
    9:     bufferManager.ReturnBuffer(buffer.Array);
   10:     // most interoperable to include the xml declaration
   11:     this.writerSettings.OmitXmlDeclaration = false;
   12:     // save the encoding for when we write the response
   13:     this.writerSettings.Encoding = GetEncoding(contentType, msgContents);
   14:  
   15:     Encoding xmlDeclEncoding = GetXmlDeclEncoding(
   16:                                 writerSettings.Encoding, msgContents);
   17:  
   18:     // xml declaration encoding doesn't match, need to reencode
   19:     if (xmlDeclEncoding != null && 
   20:         xmlDeclEncoding.WebName != this.writerSettings.Encoding.WebName)
   21:     {
   22:         msgContents = Encoding.Convert(
   23:                         this.writerSettings.Encoding, 
   24:                         xmlDeclEncoding, msgContents);
   25:     }
   26:  
   27:     MemoryStream stream = new MemoryStream(msgContents);
   28:     XmlReader reader = XmlReader.Create(stream);
   29:     return Message.CreateMessage(reader, maxSizeOfHeaders, MessageVersion);
   30: }

This does a couple of things.  It uses some helper functions to help determine the actual encoding of the message and stores that in the XmlWriterSettings that will be used in the response.  Unfortunately, there isn’t an implementation of XmlReader that ships with .NET that will ignore the encoding specified in the XML declaration so if the content type and XML declaration disagree the message needs to be reencoded to match the XML declaration.  Finally XmlDictionaryReader is the source of the restrictions above so we use XmlReader.Create to avoid that.

Now let’s look at the encoding helper functions:

    1: Encoding GetXmlDeclEncoding(Encoding contentTypeEncoding, byte[] message)
    2: {
    3:     // get a possible xml declaration from the message
    4:     string contents = contentTypeEncoding.GetString(message, 0, 100);
    5:  
    6:     // capture the encoding from the xml declaraion
    7:     string pattern = 
    8:         "<\\?xml\\sversion=\"1.0\"\\sencoding=\"(?<encoding>[\\w|-]+)\"";
    9:  
   10:     Match m = Regex.Match(contents, pattern, RegexOptions.ExplicitCapture);
   11:  
   12:     if (m.Groups["encoding"].Success)
   13:     {
   14:         return Encoding.GetEncoding(m.Groups["encoding"].Value);
   15:     }
   16:  
   17:     return null;
   18: }

GetXmlDeclEncoding tries to find an encoding pseudoattribute in a possible XML declaration at the beginning of the message.  Since the message must be encoded as specified in the content type we use that encoding to decode the byte array.  XmlReader provides ways to determine values specified in the XML declaration, but as mentioned above it will throw when it discovers the specified encoding doesn’t match the actual encoding.  A regular expression match is used to avoid this.  We only decode the first 100 bytes to minimize overhead.

    1: Encoding GetEncoding(string contentType, byte[] message)
    2: {
    3:     Encoding encoding;
    4:     string charSet = new ContentType(contentType).CharSet.ToLower();
    5:     if (charSet == "utf-8")
    6:     {
    7:         encoding = new UTF8Encoding();
    8:     }
    9:     else if (charSet == "utf-16")
   10:     {
   11:         if (message[0] == 0xff && message[1] == 0xfe)
   12:         {
   13:             encoding = new UnicodeEncoding(false, true);
   14:         }
   15:         else if (message[0] == 0xfe && message[1] == 0xff)
   16:         {
   17:             encoding = new UnicodeEncoding(true, true);
   18:         }
   19:         else
   20:         {
   21:             throw new InvalidOperationException("No byte order mark " + 
   22:                 "detected.  The byte order mark is required when " +
   23:                 "charset=utf-16.");
   24:         }
   25:     }
   26:     else if (charSet == "utf-16le")
   27:     {
   28:         encoding = new UnicodeEncoding(false, false);
   29:     }
   30:     else if (charSet == "utf-16be")
   31:     {
   32:         encoding = new UnicodeEncoding(true, false);
   33:     }
   34:     else
   35:     {
   36:         // this could be replaced by a call to Encoding.GetEncoding
   37:         throw new InvalidOperationException("Unrecognized charset: " 
   38:                     + charSet);
   39:     }
   40:  
   41:     return encoding;
   42: }

GetEncoding is a fairly straightforward implementation of the rules above.  For undifferentiated UTF-16 it checks the BOM and creates the proper encoding that will include a BOM.  For UTF-16LE/BE it creates the proper encoding and suppresses the BOM (it ignores whether a BOM is included making a bit looser on read).  This only recognizes UTF-8 and UTF-16, but it could be modified to call Encoding.GetEncoding to get any supported encoding.

Finally let’s look at writing a message:

    1: public override ArraySegment<byte> WriteMessage(Message message, 
    2:                 int maxMessageSize, BufferManager bufferManager, 
    3:                 int messageOffset)
    4: {
    5:     MemoryStream stream = new MemoryStream();
    6:     XmlWriter writer = XmlWriter.Create(stream, this.writerSettings);
    7:     message.WriteMessage(writer);
    8:     writer.Close();
    9:  
   10:     byte[] messageBytes = stream.GetBuffer();
   11:     int messageLength = (int)stream.Position;
   12:     stream.Close();
   13:  
   14:     int totalLength = messageLength + messageOffset;
   15:     byte[] totalBytes = bufferManager.TakeBuffer(totalLength);
   16:     Array.Copy(messageBytes, 0, totalBytes, messageOffset, messageLength);
   17:  
   18:     ArraySegment<byte> byteArray = 
   19:         new ArraySegment<byte>(totalBytes, messageOffset, messageLength);
   20:     return byteArray;
   21: }
   22:  
   23: public override void WriteMessage(Message message, Stream stream)
   24: {
   25:     XmlWriter writer = XmlWriter.Create(stream, this.writerSettings);
   26:     message.WriteMessage(writer);
   27:     writer.Close();
   28: }

These are actually identical to the WriteMessage implementations in the custom encoder sample.  They simple use the writerSettings populated by ReadMessage to write out the message.

This will require a CustomBinding to use.  It is possible to do it in code:

    1: var messageEncoding = new ImprovedTextMessageBindingElement
    2: {
    3:     // any message version is supported
    4:     MessageVersion = MessageVersion.Soap12
    5: };
    6:  
    7: var transport = new HttpTransportBindingElement();
    8:  
    9: var binding = new CustomBinding(messageEncoding, transport);

Or config:

    1: <customBinding> 
    2:     <binding> 
    3:         <improvedTextMessageEncoding /> 
    4:         <httpTransport /> 
    5:     </binding> 
    6: </customBinding> 
    7: