First async call from WCF client always takes more than 1 second

This is one of the first issues I got to investigate on the WCF/WF performance team. Customers reported that (in .Net 3.5 and earlier), the first asynchronous call from a WCF client was taking over 1 second each time, even with synchronous calls happening before it. The problem can be easily reproduced by modifying the Asynchronous sample in the WCF/WF samples (found under WCF\Basic\Contract\Service\Asynchronous). First, let's replace the client.cs code with the following:

using System;

using System.Diagnostics;

using System.Threading;

namespace Microsoft.ServiceModel.Samples

{

    class Client

    {

        static Stopwatch stopWatch = new Stopwatch();

        static AutoResetEvent waitEvent = new AutoResetEvent(false);

        static void Main()

        {

            Console.WriteLine("Press enter to start test");

            Console.ReadLine();

            SyncTest();

            AsyncTest();

            AsyncTest();

           SyncTest();

        }

        static void SyncTest()

        {

            Console.WriteLine();

            Console.WriteLine("Sync test");

            Console.WriteLine("------------------------------------");

            Console.WriteLine();

         CalculatorClient client = new CalculatorClient();

            client.Open();

            MakeSyncCall(client);

            MakeSyncCall(client);

            client.Close();

        }

        static void AsyncTest()

        {

            Console.WriteLine();

            Console.WriteLine("Async test");

            Console.WriteLine("------------------------------------");

            Console.WriteLine();

            CalculatorClient client = new CalculatorClient();

            client.Open();

   MakeAsyncCall(client);

            MakeAsyncCall(client);

            client.Close();

        }

        static void MakeSyncCall(CalculatorClient client)

        {

            stopWatch.Start();

            double result = client.Add(1, 2);

     Console.WriteLine("Add Result: {0}", result);

            StopWatchStop(false);

        }

        static void MakeAsyncCall(CalculatorClient client)

        {

            stopWatch.Start();

            IAsyncResult arAdd = client.BeginAdd(1, 2, AddCallback, client);

            Console.WriteLine("BeginAdd");

            waitEvent.WaitOne();

        }

        static void StopWatchStop(bool set)

        {

            stopWatch.Stop();

            Console.WriteLine("\t\tTime(ms): {0}", stopWatch.ElapsedMilliseconds.ToString().PadLeft(4));

            stopWatch.Reset();

            if (set)

            {

                waitEvent.Set();

            }

        }

        static void AddCallback(IAsyncResult ar)

        {

            double result = ((CalculatorClient)ar.AsyncState).EndAdd(ar);

            Console.WriteLine("Add Result: {0}", result);

            StopWatchStop(true);

        }

        static void SubtractCallback(object sender, SubtractCompletedEventArgs e)

        {

            Console.WriteLine("Subtract Result: {0}", e.Result);

        }

    }

}

The test is pretty basic. I have created two methods, SyncTest and AsyncTest, that call the Add method on the calculator service to add the numbers 1 and 2. The first SyncTest call is used to show that there is some setup time involved with establishing connections, etc. This is an expected latency. There are two synchronous calls per test to illustrate that there is some latency with the initial call and subsequent calls are much faster. With the setup time out of the way with the SyncTest, the AsyncTest should not have to take that much penalty. To be clear, 1 second of latency for the first call is a long time. The AsyncTest and SyncTest tests are run again to show that the 1 second delay doesn't happen again. Here is the output from the test:

 Press enter to start test


Sync test
------------------------------------

Add Result: 3
                Time(ms):   36
Add Result: 3
                Time(ms):    1

Async test
------------------------------------

BeginAdd
Add Result: 3
                Time(ms): 1005
BeginAdd
Add Result: 3
                Time(ms):    1

Async test
------------------------------------

BeginAdd
Add Result: 3
                Time(ms):    4
BeginAdd
Add Result: 3
                Time(ms):    1

Sync test
------------------------------------

Add Result: 3
                Time(ms):    0
Add Result: 3
                Time(ms):    0

 Notice that the first async call takes 1005ms. The question is why? CPU usage is low during this time period. Is there some contention in the WCF code causing this problem?

The tool that probably shows what's going on in the most effective way is new to Visual Studio 2010 and included with the Ultimate edition. You can get to it by going to the Analyze menu in Visual Studio and selecting Launch Performance Wizard. The profiling method we want is Concurrency. Note that "Collect resource contention data" is unchecked, this is intentional.

Start the service.exe outside of VS and then profile the client.exe. When the profiling is finished, select the Thread view. You should see something similar to this:

At the top, we can see the Main thread (6436). There is a very clear section of this thread that is colored in red and lasts for 1 second. That is the time period where the main thread is waiting for the response from that first async call. Notice that during this time there is no CPU activity (colored in green). In fact, there's very little activity at all. The interesting thing to take notice of is the thread at the very bottom (4824). This thread is just starting right before the 1 second wait finishes on the main thread. That thread warrants a closer look. I'll move the two threads closer together and remove other threads we're not worried about. Also, let's just focus on what's going on for that 1 second.

In the image above, we see the activity at the start of the async call on the main thread. We also see a brand new thread (4996) that does nothing. Plus, there is another new thread (4788) that immediately goes to sleep. You can see from the highlighted portion that the sleep is for 500ms. If you look at the lifetime of this thread you will see that it consistently sleeps in 500ms increments. What is this thread for? From some discussions with the guys behind the CLR thread pool I found out that this thread is used to increase the number of threads in the pool. Every 500ms it checks to see if the number of threads in the thread pool is enough to handle the work.

It gets more interesting if we zoom in to the end of the 1 second wait:

Notice how the 4788 thread, at the end of its 500ms sleep, has suddenly decided to create a new completion port thread? If you examine the green part on thread 4824, you will see some calls through WCF to perform the asynchronous work. So the 1 second wait is not due to WCF code, but due to the creation of the IOCP thread. So the question is why are the threads waiting? There was a thread created at the beginning of the 1 second wait. Why is that thread not being used to handle the async call? To understand this, we can look into the creation of that thread.

The first section I'm interested in is the highlighted synchronization section above. It has the following stack trace:

 kernel32.dll!CreateThreadStub
mscorwks.dll!Thread::CreateNewOSThread
mscorwks.dll!Thread::CreateNewThread
mscorwks.dll!ThreadpoolMgr::CreateUnimpersonatedThread
mscorwks.dll!ThreadpoolMgr::CreateCompletionPortThread
mscorwks.dll!ThreadpoolMgr::PostQueuedCompletionStatus
mscorwks.dll!ThreadPoolNative::CorPostQueuedCompletionStatus
system.servicemodel.dll!System.ServiceModel.Channels.IOThreadScheduler+CriticalHelper.ScheduleCallback
system.servicemodel.dll!System.ServiceModel.Channels.IOThreadTimer+TimerManager.Set
system.servicemodel.dll!System.ServiceModel.Channels.IOThreadTimer.Set
system.servicemodel.dll!System.ServiceModel.Dispatcher.DuplexChannelBinder+AsyncDuplexRequest..ctor
system.servicemodel.dll!System.ServiceModel.Dispatcher.DuplexChannelBinder.BeginRequest
system.servicemodel.dll!System.ServiceModel.Channels.ServiceChannel+SendAsyncResult.StartSend
system.servicemodel.dll!System.ServiceModel.Channels.ServiceChannel+SendAsyncResult.Begin
system.servicemodel.dll!System.ServiceModel.Channels.ServiceChannel.BeginCall
system.servicemodel.dll!System.ServiceModel.Channels.ServiceChannelProxy.InvokeBeginService
system.servicemodel.dll!System.ServiceModel.Channels.ServiceChannelProxy.Invoke
mscorlib.dll!System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke
mscorwks.dll!CallTargetWorker2
mscorwks.dll![mscorwks.dll]
mscorwks.dll!TransparentProxyStub_CrossContext
client.exe!Microsoft.ServiceModel.Samples.CalculatorClient.BeginAdd

 We can see something called the IOThreadTimer+TimerManager in this stack. This makes sense as the async call has a timeout associated with it and there has to be something that handles those timeouts. So let's assume that the thread pool has already created a new IOCP thread. That thread is now blocked for the timer manager and cannot do the work for the async call. Why then does it not realize that there are not enough threads in the pool at the first 500ms mark?

There were two problems causing this. When the async call was initially made, two threads should have been created: one for the timer manager and one for handling the async call. Instead, only one was created due to a bug in the CLR. At the first 500ms interval, the thread pool manager should have noticed that there were not enough threads and created a new one. This also failed due to the same bug.

If you now switch the target framework version of the client.exe to .Net 4, you will see different behavior entirely. Here is a shot of what happens there.

The delay for the first async call is now only 189ms for the call itself (it's a little slow because of the attached profiler). Plus, there's no 1 second waiting period prior. Notice that the 500ms sleeper thread is still there as thread id 4648. Except now, it starts first and creates four threads at the same time. Why four? Since I'm on a quad core machine and the default min IO thread count is 1 per processor and there is sufficient work to occupy 4 IO threads, those threads are created. If you set the min IO threads down to 2, it will only create 2 threads because it will first go to min IO threads and then reevaluate every 500ms to see if there are enough threads in the pool. If you set it higher, like 8 or 200, it will only create 6 threads since it only has enough work to occupy those 6. You can kind of infer the logic behind the thread creation here so I won't elaborate further.