502 Exception, while connecting with Service Bus Relay endpoint during Load test with more than 100 concurrent users

 

Recently, I was working with one of my customer, were they were leveraging Windows Azure Service Bus Relay endpoint to connect their Service endpoint with the client. Here customer noticed ‘502 - Bad Gateway" during the load test after hitting more than 100 concurrent users, and I anticipating more customer may face similar kind of issue, while leveraging Service Bus relay with older Service Bus client library,  and thought of blog about this issue and how we finally able to mitigate

PROBLEM DEFINITION

Customer developed an on premise WCF Service, which could be connected through a Service Bus Relay endpoint. During the load testing with more than 100 concurrent users, they noticed that client application consistently throwing  ‘502 - Bad Gateway’ exception

INVESTIGATION DETAILS

Investigate this issue, we have collected below mentioned data during the time of issue

    1. Service Bus tracking id for the failed request
    2. Simultaneous Netmon trace from both server and client side
    3. Collected Memory hang dump from Server during the time of issue.

>    Service Bus tracking id for the failed request : During the time of Load testing, we have identified below mentioned failed tracking id

  • <Error>
  •   <Code>502</Code>
  •   <Detail>Bad Gateway.TrackingId:69a242f6-633e-41b6-a623-afabdbb1d139_G25,TimeStamp:xx/xx/20xx 6:25:44 AM</Detail>
  • </Error>

    Reviewed the details about above mentioned tracking id from Service Bus backend infrastructure, we got more details as given below:

  • <snip>
  • Message       = Failure processing web request. Request Uri: https://microsoftapprovalstest.servicebus.windows.net/<<servicebus relay endpoint>>
  • Exception: Bad Gateway ---> System.ServiceModel.EndpointNotFoundException: There are no active listeners registered for endpoint: sb://microsoftapprovalstest.servicebus.windows.net/<<servicebus relay endpoint>>.
  • <snip>

> The above exception in the Service Bus server side indicate that due to some unknown reason, SB relay component is unable to communicate with Listener end. So in general one of the below mentioned failure might cause this kind of issue :

o Network issue between Service Bus Relay end and Listener Side (we have ruled out this issue by reviewing the network trace, were we couldn’t find any reset connection in the network trace, and also confirmed that service is using WebStream protocol for communication.)

o Service bus Listener component is not responding. (Reviewed the memory dump and I can see that there around 380  threads, most of them are waiting on some kind of Servicebus call, however WCF Server is in healthy state.

>  Based on the above details, I have reviewed some of the historical data, and noticed that this might be an issue with Service Bus client library, which handling the WebStream protocol, and I have also noticed that on latest Service Bus client binaries, there were lots of improvement in handling concurrent connections. Possibly the new Web Sockets protocol can help to better negotiate through certain proxy servers.  The following are two new major Relay additional features for Service Bus Client Library shipped with 2.3 SDK Version.

Web Sockets - The Relay HTTP/HTTPS previously used a MS proprietary protocol called WebStreams. Now support for open standard Web Sockets has been added. So when using HTTP/HTTPS Web Sockets will be tried before falling back to Web Streams. Only Relay supports the Web Sockets protocol. Messaging does not support this. This change should be transparent to customers. Requires the Microsoft.ServiceBus.dll version 2.3 to be shipped with Azure SDK 2.3 or from Nuget.

AMQP Control Channel - Once gain a transparent change to customer. For Relay listeners a WCF Control channel is established with the service. This is what indicates to the Service a listener is connected and what the service uses to tell the listener than clients are connecting to it. With this release the control channel use AMQP rather than WCF. Since AMQP is a lighter weight connection this is expected to provide better performance and COGS benefits. Requires the Microsoft.ServiceBus.dll version 2.3 to be shipped with Azure SDK 2.3 or from Nuget.

> So based on the above details, my customer had upgraded to latest Service Bus client library NuGet package from https://www.nuget.org/packages/WindowsAzure.ServiceBus  , and he confirmed that they were able to do a load testing with more than 150 concurrent users without any issue after upgrading to the latest Service Bus client library .