How to remove invalid character in incoming XML message using custom pipeline component

 

BizTalk server is usually used as message bus for different platforms and applications. In most scenarios, BizTalk stores and processes messages in XML format. Internally, it calls .NET XML library to do the job. Thus it follows W3C XML standard to validate message and check encoding.

However, in some scenarios, messages from different platforms or applications didn’t strictly follow W3C XML standard and may include invalid characters. These characters may lead to BizTalk message processing failure in various stages, for example, XML Receive pipelines, message mapping. The below is one sample error message you may find in BizTalk event log:

 

“Unable to read the stream produced by the pipeline. Details: '', hexadecimal value 0x0F, is an invalid character.

 

*Note: According to W3C, XML processors should accept any character in following range

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

For more detail, please refer to:

https://www.w3.org/TR/REC-xml/#charsets

 

Generally speaking, the formal way of solving this problem is to fix the source system and prevent it from sending out invalid characters in XML message. Unfortunately, this way is not always available because the source system may not under our control. So it may be hard to persuade the owner to spend their money solving the problem. Writing a custom pipeline to remove the invalid characters before BizTalk processing the message would be one workaround in this scenario.

The below is the sample implementation for Execute interface of pipeline component. Here are the several stages in this sample:

· Read original message stream into text

· Use Regular expression to remove invalid characters

· Add Unicode Byte Order Mark

· Assign the new stream to message

Note: Here just a sample of using Unicode encoding. User can using UTF-8 instead or even implement design time support to switch between different supported encoding format.

 

        public IBaseMessage Execute(IPipelineContext pc, IBaseMessage inmsg)

        {

            string oldXMLText,newXMLText;

            Stream newStr;

            IBaseMessagePart bodyPart = inmsg.BodyPart;

            if (bodyPart != null)

            {

                using (StreamReader sr = new StreamReader(bodyPart.GetOriginalDataStream()))

                {

                    //Remove Invalid Character

                    oldXMLText = sr.ReadToEnd();

                    string re = @"[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]";

                    newXMLText = System.Text.RegularExpressions.Regex.Replace(oldXMLText, re, "");

                    //Add Byte Order Mark

                    byte[] b = new byte[newXMLText.Length * 2 + 2];

                    b[0] = 0xFF;

                    b[1] = 0xFE;

                    byte[] textb = System.Text.UnicodeEncoding.Unicode.GetBytes(newXMLText);

                    for (int i = 0; i < textb.Length; i++)

                    {

                        b[i + 2] = textb[i];

                     }

                    newStr = new MemoryStream(b);

                    newStr.Flush();

                    bodyPart.Data = newStr;

     pc.ResourceTracker.AddResource(newStr);

                }

            }

           return inmsg;

        }

Best Regards,

Bryan Yang