WCF bug with WS-AtomicTransaction expiration value

The team at Microsoft that handles WCF and WF also handles Transactions. This includes System.Transactions, DTC, and the WS-AtomicTransaction implementation available in WCF. Not too long ago I got a chance to diagnose a pretty strange customer issue. The customer reported getting the following exception when using WS-AT under load:

The incoming transaction cannot be deserialized. The transaction header in the message was either malformed or in an unrecognized format. The client and the service must be configured to use the same protocol and protocol version. The following exception occurred: The WS-AtomicTransaction header was invalid or corrupt.

The exception message above is usually received on the client side. On the server side, the exception looks like this:

 Microsoft.Transactions.Wsat.Messaging.InvalidCoordinationContextException: The coordination context was not understood ---> 
      System.Xml.XmlException: The value '3837244160' cannot be parsed as the type 'Int32'. ---> 
      System.OverflowException: Value was either too large or too small for an Int32.
   at System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
   at System.Xml.XmlConverter.ToInt32(String value)
   --- End of inner exception stack trace ---
   at System.Xml.XmlConverter.ToInt32(String value)
   at System.Xml.XmlConverter.ToInt32(Byte[] buffer, Int32 offset, Int32 count)
   at System.Xml.XmlBaseReader.ReadContentAsInt()
   at System.Xml.XmlDictionaryReader.ReadElementContentAsInt()
   at Microsoft.Transactions.Wsat.Messaging.CoordinationContext.ReadFrom(CoordinationContext that, 
      XmlDictionaryReader reader, XmlDictionaryString localName, XmlDictionaryString ns, ProtocolVersion 
      protocolVersion)
   --- End of inner exception stack trace ---
   at Microsoft.Transactions.Wsat.Messaging.CoordinationContext.ReadFrom(CoordinationContext that, 
      XmlDictionaryReader reader, XmlDictionaryString localName, XmlDictionaryString ns, ProtocolVersion 
      protocolVersion)
   at Microsoft.Transactions.Wsat.Messaging.CoordinationContext.ReadFrom(XmlDictionaryReader reader, 
      XmlDictionaryString localName, XmlDictionaryString ns, ProtocolVersion protocolVersion)
   at System.ServiceModel.Transactions.WsatTransactionHeader.GetCoordinationContext
      (XmlDictionaryReader reader, ProtocolVersion protocolVersion)

The first thing to notice from the exception above is that the value indicated, 3837244160, is indeed greater than Int32.MaxValue. What is not obvious is where this value comes from. Using WCF diagnostic tracing on both the client and server side, we can see the actual messages and that may point us to where this exception comes from.

The test case here is a client starting a transaction, doing some database work in SQL Server, and then flowing the transaction to a WCF service that does some other work. The exception only happens when the machines are under stress. This makes capturing a useful diagnostic trace difficult because of the large file size of the log. The processes can quickly become IO-bound waiting to write data to the disk, especially in a performance environment. However, the customer was able to find the message from the client side that produced the exception. Here is a sample of what that message looks like:

 

<E2ETraceEvent xmlns=" https://schemas.microsoft.com/2004/06/E2ETraceEvent ">

<System xmlns=" https://schemas.microsoft.com/2004/06/windows/eventlog/system ">

<EventID>262164</EventID>

<Type>3</Type>

<SubType Name="Information">0</SubType>

<Level>8</Level>

<TimeCreated SystemTime="2010-08-05T19:02:35.3968885Z" />

<Source Name="System.ServiceModel" />

<Correlation ActivityID=" {00000000-0000-0000-0000-000000000000} " />

<Execution ProcessName="client" ProcessID="6284" ThreadID="5" />

<Channel />

<Computer>DMETZGAR1</Computer>

</System>

<ApplicationData>

<TraceData>

<DataItem>

<TraceRecord xmlns=" https://schemas.microsoft.com/2004/10/E2ETraceEvent/TraceRecord " Severity="Information">

<TraceIdentifier> https://msdn.microsoft.com/en-US/library/System.ServiceModel.Channels.MessageSent.aspx</TraceIdentifier >

<Description>Sent a message over a channel. </Description>

<AppDomain>client.exe</AppDomain>

<Source>System.ServiceModel.Channels.HttpOutput+WebRequestHttpOutput/13971778</Source>

<ExtendedData xmlns=" https://schemas.microsoft.com/2006/08/ServiceModel/MessageTraceRecord ">

<MessageProperties>

<Encoder>application/soap+xml; charset=utf-8</Encoder>

<AllowOutputBatching>False</AllowOutputBatching>

<Via> https://localhost:8010/ServiceModelSamples/service</Via >

</MessageProperties>

<MessageHeaders>

<Action d4p1:mustUnderstand="1" xmlns:d4p1=" https://www.w3.org/2003/05/soap-envelope " xmlns=" https://www.w3.org/2005/08/addressing "> https://Microsoft.ServiceModel.Samples/ICalculator/Add</Action >

<MessageID xmlns=" https://www.w3.org/2005/08/addressing ">urn:uuid:f6918cea-737f-49a1-91a9-c920ae53fbb6</MessageID>

<ActivityId CorrelationId="556d49a0-4d33-47c9-b8a3-85c285902c1f" xmlns=" https://schemas.microsoft.com/2004/09/ServiceModel/Diagnostics ">59ee1a2d-155d-40c8-85d2-86e3da801b21</ActivityId>

<ReplyTo xmlns=" https://www.w3.org/2005/08/addressing ">

<Address> https://www.w3.org/2005/08/addressing/anonymous</Address >

</ReplyTo>

<CoordinationContext d4p1:mustUnderstand="1" xmlns:mstx=" https://schemas.microsoft.com/ws/2006/02/transactions " xmlns:d4p1=" https://www.w3.org/2003/05/soap-envelope " xmlns=" https://schemas.xmlsoap.org/ws/2004/10/wscoor ">

<wscoor:Identifier xmlns:wscoor=" https://schemas.xmlsoap.org/ws/2004/10/wscoor ">urn:uuid:d4b5a924-775a-49ed-9923-8643d46bf62b</wscoor:Identifier>

<Expires>3837244160</Expires>

<CoordinationType> https://schemas.xmlsoap.org/ws/2004/10/wsat</CoordinationType >

<RegistrationService>

<Address xmlns=" https://schemas.xmlsoap.org/ws/2004/08/addressing "> https://dmetzgar1/WsatService/Registration/Coordinator/Disabled/</Address >

<ReferenceParameters xmlns=" https://schemas.xmlsoap.org/ws/2004/08/addressing ">

<mstx:RegisterInfo>

<mstx:LocalTransactionId>d4b5a924-775a-49ed-9923-8643d46bf62b</mstx:LocalTransactionId>

</mstx:RegisterInfo>

</ReferenceParameters>

</RegistrationService>

<mstx:IsolationLevel>0</mstx:IsolationLevel>

<mstx:LocalTransactionId>d4b5a924-775a-49ed-9923-8643d46bf62b</mstx:LocalTransactionId>

<PropagationToken xmlns=" https://schemas.microsoft.com/ws/2006/02/tx/oletx "></PropagationToken>

</CoordinationContext>

<To d4p1:mustUnderstand="1" xmlns:d4p1=" https://www.w3.org/2003/05/soap-envelope " xmlns=" https://www.w3.org/2005/08/addressing "> https://localhost:8010/ServiceModelSamples/service</To >

</MessageHeaders>

</ExtendedData>

</TraceRecord>

</DataItem>

</TraceData>

</ApplicationData>

</E2ETraceEvent>

Here we can see that the Expires property is the one with the invalid value. This is a little strange because the expiration value isn't really something that is supposed to change. First, let's try to understand why the message going out can have a Expires value greater than Int32.MaxValue when the service is unable to read it. According to the WS-Coordination schema, Expires is an unsigned integer, which should handle the value highlighted above. This means that WCF is reading the Expires field as a signed integer. Reading the WCF code, the WS-AT implementation is the culprit. As an aside, the OleTx code correctly reads the value as an unsigned integer so this issue is only for WS-AT.

So that's one bug found in the WCF code. But why does the Expires property get this random value in the first place? Even with the WCF code fixed, this would still be a problem because if it randomly gets high values, it is also possible to randomly get really low values that would cause transactions to timeout when they're not supposed to. This is where many of these customer issues get tricky. It is expensive and time-consuming to reproduce the customer's environment inside Microsoft. Other options are to send someone to the client site or to capture traces. A memory dump was already captured by our customer and did not reveal anything interesting. There is another kind of trace called TTT or iDNA that actually captures all the instructions and memory from a process for a specified period of time. It essentially allows you to step forward and backward through a process's execution. This really slows down the process though and most often will not work with performance/stress bugs like this. Also, there are many processes involved: the client, DTC, SQL Server, and the service. Any one of these processes could have initiated the problem.

Luckily, a reproduction of the bug was easy to achieve. I modified the TransactionFlow example shipped in the WCF/WF samples to perform some SQL Server work on the client side before making calls to the calculator service. Here is the modified TransactionFlow sample code:

using System;

using System.ServiceModel;

using System.Transactions;

using System.Threading;

using System.Text.RegularExpressions;

using System.IO;

namespace Microsoft.ServiceModel.Samples

{

    //The service contract is defined in generatedClient.cs, generated from the service by the svcutil tool.

    //Client implementation code.

    class Client

    {

        const int NumThreads = 2;

        static Regex re = new Regex(@"<Expires>(\d+)</Expires>", RegexOptions.Compiled);

        static void Main()

        {

            Console.WriteLine("Press enter to start");

            Console.ReadLine();

            for (int j = 0; j < 1000; j++)

            {

                Console.WriteLine("Running iteration {0}", j);

                WaitHandle[] events = new WaitHandle[NumThreads];

                for (int i = 0; i < NumThreads; i++)

                {

                    ThreadPool.QueueUserWorkItem(new WaitCallback(RunTx), events[i] = new ManualResetEvent(false));

                }

                WaitHandle.WaitAll(events);

                using (StreamReader sr = new StreamReader(@"App_tracelog.svclog"))

                {

                    string s = sr.ReadToEnd();

                    MatchCollection matches = re.Matches(s);

                    foreach (Match match in matches)

                    {

                        uint u = uint.Parse(match.Groups[1].Value);

                        if (u != 300000)

           {

                            Console.WriteLine("HEY!!!! Got something");

                            Console.WriteLine(u);

                            break;

                        }

                    }

                }

                File.Delete(@"App_tracelog.svclog");

            }

            Console.WriteLine("Press <ENTER> to terminate client.");

            Console.ReadLine();

        }

        static void RunTx(object obj)

        {

            ManualResetEvent mre = obj as ManualResetEvent;

            // Create a client using either wsat or oletx endpoint configurations

            CalculatorClient client = new CalculatorClient("WSAtomicTransaction_endpoint");

            // Start a transaction scope

            using (TransactionScope tx =

                        new TransactionScope(TransactionScopeOption.RequiresNew))

            {

                DustinTestModel.DustinTestEntities context = new DustinTestModel.DustinTestEntities();

                context.ExampleTables.AddObject(new DustinTestModel.ExampleTable() { description = "Hello World" });

                context.SaveChanges();

                Console.WriteLine("Starting transaction");

                // Call the Add service operation

                // - generatedClient will flow the required active transaction

                double value1 = 100.00D;

                double value2 = 15.99D;

                double result = 0d;

                    result = client.Add(value1, value2);

                Console.WriteLine(" Add({0},{1}) = {2}", value1, value2, result);

                // Call the Subtract service operation

                // - generatedClient will flow the allowed active transaction

                value1 = 145.00D;

                value2 = 76.54D;

                result = client.Subtract(value1, value2);

                Console.WriteLine(" Subtract({0},{1}) = {2}", value1, value2, result);

                // Start a transaction scope that suppresses the current transaction

                using (TransactionScope txSuppress =

                            new TransactionScope(TransactionScopeOption.Suppress))

                {

                    // Call the Subtract service operation

                    // - the active transaction is suppressed from the generatedClient

                    // and no transaction will flow

                    value1 = 21.05D;

                    value2 = 42.16D;

                    result = client.Subtract(value1, value2);

                    Console.WriteLine(" Subtract({0},{1}) = {2}", value1, value2, result);

                    // Complete the suppressed scope

                    txSuppress.Complete();

                }

                // Call the Multiply service operation

                // - generatedClient will not flow the active transaction

                value1 = 9.00D;

                value2 = 81.25D;

                result = client.Multiply(value1, value2);

                Console.WriteLine(" Multiply({0},{1}) = {2}", value1, value2, result);

                // Call the Divide service operation.

                // - generatedClient will not flow the active transaction

                value1 = 22.00D;

                value2 = 7.00D;

                result = client.Divide(value1, value2);

                Console.WriteLine(" Divide({0},{1}) = {2}", value1, value2, result);

                // Complete the transaction scope

                Console.WriteLine(" Completing transaction");

                tx.Complete();

            }

            Console.WriteLine("Transaction committed");

            // Closing the client gracefully closes the connection and cleans up resources

            client.Close();

            mre.Set();

        }

    }

}

In the code above, notice that I am using a ManualResetEvent to coordinate the execution of two simultaneous invocations of RunTx. This is done in a loop for 1000 times. I found that just pushing 2000 RunTx executions into the thread pool did not reliably reproduce the error. It was most important to get the simultaneous execution of the transactions. I also have diagnostic tracing enabled and I'm reading the log file for values of Expires that go above a certain threshold. The code that performs the database operation on the client side is some simple Entity Framework code to add a row to a table.

This code reproduces the error quite effectively. However, when attaching a debugger to the process to catch the exception, the error very rarely occurs. I found that the debugger slows down the process enough that it won't push the system. One way to take care of this was to run two simultaneous client processes: one with the debugger and one without. Using this technique I was able to capture enough information to point the transactions team to the real problem.

Without the modifications to the TransactionFlow sample, one should notice that the Expires value is set to 59904ms. When the modifications are made to call SQL Server, the Expires value changes to 300000ms. This means that the timeout value comes from a different place when we include SQL Server in the transaction. Under load, this 300000ms value sometimes gets randomized. The transactions team isolated this to an uninitialized variable in the native transactions code in Windows. Only under stress does this problem occur.

The fix has already been made for future versions of Windows but has not been included in Windows 7 or earlier. Also, the fix to WCF did not make it into the .Net 4.0 framework release but should be included in a future version. However, there are a couple workarounds available to you should you run into this problem. One is to use OleTx instead of WS-AT, as mentioned before. The other workaround is to force the transaction to elevate to a distributed transaction before the call to SQL Server on the client side. That can be done with the following code:

byte[] propagationToken = TransactionInterop.GetTransmitterPropagationToken(Transaction.Current);

When this call is made, the expiration value should be 59904ms or whatever is set as the transaction timeout for the client process.