Unable to start BizTalk RFID Process when it is bound to too many devices (Device Provider specific issue)

Recently I worked on a very interesting BizTalk RFID issue where the RFID Services appear to hang when trying to start a process that is bound to 25 devices. After a great deal of isolation and some invaluable assistance from Mark Simms (Thank you Mark!) we were able to identify that the issue was occurring as the custom provider was running out of worker threads in its .NET Thread Pool.

The symptoms observed were as under:

2 BizTalk RFID Processes

1 of them connected to 6 devices and the other to 15 devices.

The minute we try to bind one of the processes to 3 more devices (total of 24 devices), the performance seen in the RFID Manager deteriorates and we see all the devices going into Retrying State.

(The performance deterioration mentioned here does not refer to the performance of the RFID Manager by itself. But the behavior observed when the MMC tries to retrieve the latest state of various components - Providers, Devices, Processes etc. - which basically means that the MMC is waiting for the Device Manager, Provider Manager or the Process Manager to respond)

After being busy for a very long time (much longer than the default device connection timeout of 00:01:00), the MMC finally returns with the following error:

"This request operation sent to net.tcp://crpcltiis51e:7891/rfid/service/ProcessManager did not receive a reply within the configured timeout (00:01:00). The time allotted to this operation may have been a portion of a longer timeout. This may be because the service is still processing the operation or because the service was unable to send a reply message. Please consider increasing the operation timeout (by casting the channel/proxy to IContextChannel and setting the OperationTimeout property) and ensure that the service is able to connect to the client."

Troubleshooting

Now considering that there were too many variables involved (Multiple processes, Numerous devices, 3rd party device provider) I decided to start from a more basic test and then add each of the components gradually. Here were some of the questiosns/tests I had in mind to try and isolate the issue:

1. Does the issue happen with a single process connected to multiple devices?

Yes, this was indeed the case. A single process was also failing when trying to connect to 25 devices.

2. What is the threshold at which we were failing?

Created a test process, bound it to 5 devices to start with - the test was successful. Added 5 more devices in each progressive test and found that we are able to connect to 20 devices successfully. It worked fine with even 22 devices, but failed after binding the process to the 23rd device.

3. Could the issue be with the 23rd device that we just added?

Could be - but this was not the case over here. Reason being, we are able to connect to that device when it was the only one in the binding list. The failure is only when a process is bound to too many devices.

4. Is there a ceiling to the number of devices that this specific provider can manage?

Not this provider. The was confirmed by the fact that the customer had other implementations where the same provider was being used to manage/connect to more than 28 devices.

So far, here is what we know:

- There is no limit that BizTalk RFID service imposes, to the number of devices that an RFID process can connect to.

- There is no limit imposed by the Custom provider either.

- This has got to be an environmental issue.

5. What do the RFID Services log show?:

For devices where we were able to connect successfully the following combination of entries were found in the RFIDServices.log:

48| Info|031909 15:31:44|In UpdateDeviceConnectionsThreadPoolProc for device Device1|[DeviceManager] (The point where we start connecting to a device)

...

...

48| Info|031909 15:31:46|Setup connection on Device1 succeeded|[Device] (Confirms a successful connection attempt)

versus, for the device where the connection attempt failed, the following combination of entries were found:

35| Info|031909 15:32:00|In UpdateDeviceConnectionsThreadPoolProc for device Device23|[DeviceManager]

35| Info|031909 15:32:00|Opening the device Device|Device23..

...

35|Warning|031909 15:33:08|Dedicated thread SetupConnection35 timing out. Regaining control forcibly by firing an event and throwing an exception|[ProviderManager]

35|Warning|031909 15:33:08|Provider CustomProvider Provider misbehaved: ThreadOrphaned|[ProviderManager]

35| Error|031909 15:33:08|The device Device|Device23... could not be connected to because of exception [The method SetupConnection has timed out. If this is a provider/device operation, the timeout can occur if the provider/device could not complete the operation within a specified time period. If you encounter this frequently, consider increasing the value of delegateThreadTimeoutMilliseconds in ProviderManagerConfiguration under RfidServerRuntimeConfiguration.] [ at MS.Internal.Rfid.Service.Provider.LocalDomainDeviceMarshaller.RunDelegateMethod(String methodName, Type[] externalTypes, Object[] externalargs)

at MS.Internal.Rfid.Service.Provider.LocalDomainDeviceMarshaller.SetupConnection(AuthenticationInformation authenticationInfo)

at MS.Internal.Rfid.Service.Devices.RfidDevice.ActuallySetupConnection(Boolean fCheckAndHandleConflicts)].|[Device]

(Notice the time difference between the point where the connection attempt started and the point of the error - so it makes sense why a timeout was occurring.)

In fact on tracking the connection time to the various devices in the device binding list, here is what we saw:

clip_image002

This performance degradation observed by near linear growth in connection times seemed to indicate that we are running out of some resource over a period of time.

(Sure, at this point we could increase the device connection timeout as suggested in the initial error message - but this would be like applying a band-aid on a more serious problem underneath. Reason: why does it take more than 1 minute to connect to a device under one scenario i.e. too many devices in the binding list, whereas it takes almost no time to connect to the same devices in a different scenario i.e. fewer devices being bound to.)

The next logical step would be to look into the provider logs to find out what was happening with the corresponding connection attempt at the provider level. To be very honest, I tried to identify the specific connection attempt but was unsuccessful :( - hence I thought of reaching out to the provider writers i.e. 3rd party vendors, and see if they could find anything in the provider logs corresponding to the "Provider misbehaved" warning.

6. On speaking with the provider writers we learnt that they were using the .NET Thread Pool to manage connections to the various devices. This prompted us to review the following article:

Contention, poor performance, and deadlocks when you make Web service requests from ASP.NET applications

This confirmed that the resource we were running out of over a period of time was .NET Worker Threads. On increasing the following settings in machine.config we were able to resolve this issue:

Configuration Setting
Value

maxconnection
12 * #CPUs

maxIoThreads
100

maxWorkerThreads
100

minFreeThreads
88 * #CPUs

minLocalRequestFreeThreads
76 * #CPUs

Please Note: The above values are only recommended values - and are by no means the *exact* values that would be needed in order to resovle the issue. Being a performance tuning scenario, this is something that depends on the environment.