Making Service Broker Application Highly Available With AlwaysOn


 

If you have a Service Broker (SSB) application connecting to SQL Server using an AlwaysOn availability group listener, in the event of an unexpected failover, some messages may be lost or stuck in the transmission queue on the old primary (new secondary) after the failover. This could be an automatic failover or manual failover when the SSB application is not stopped.

Incoming (SSB) connections are severed during a availability group failover and those connections use the listener IP address which directs connections to the new primary replica. New incoming SSB connections, established through the listener, will go to the new replica. This is the correct and intended behavior.

 

clip_image001

 

The availability group fails over and the new inbound SSB connections reset to the new primary replica:

 

clip_image002

 

However, outgoing SSB connections from the old primary replica are left open after a failover, still in session with the initiator, and remain so because the IP addresses are still valid. New outgoing SSB connections are established from the new replica to the initiator, but because the connections from the old primary are still open, incoming SSB messages are still sent to the old primary.

Since Service Broker components are not contained within the availability group, special handling is required to handle the Service Broker connections when the availability group moves to a different replica instance. If you feel the post failover handling for the Service Broker connection is not feasible to be built into your application, please submit a Connect item (http://connect.microsoft.com/sql) so that we are able to understand the community need for this functionality. As an alternative to modifying the code for your application, you could choose to use a SQL Server Failover Cluster Instance which will prevent the issue described above, due to the IP Address remaining the same on failover.

 

clip_image003

 

UPDATE: Enhanced Compatibility When Hosting Service Broker in Availability Groups

This behavior has been improved in the following versions of SQL Server:

SQL 2014 SP1/CU12 & SP2/CU5

SQL 2016 RTM/CU6 & SP1/CU3

SQL 2012 SP4

For more information on these improvements, see the following KnowledgeBase article:

FIX: Service Broker endpoint connections aren't closed after an availability group failover in SQL Server


Troubleshooting

If, after the failover, you run the following query on the old primary SSB TARGET and see the value for is_accepted = 0 then you know you have encountered the issue. You will also likely have two or more connections returned here. 



--look for is_accept = 0
select is_accept, * from sys.dm_broker_connections with(nolock)

 

Another side effect are messages stuck in the sys.transmission_queue on the old primary SSB TARGET.

If Profiler trace is run on the old primary SSB TARGET, you would see the following error:

 

The message has been dropped because the service broker in the target database is unavailable: ‘The database is in read-only mode.’


clip_image002[1]

 

SQL Server XEvent ‘broker_message_undeliverable’ can also be collected using SQL Server extended events and will also report that messages cannot be delivered. For more information on using SQL Server extended events, see:

Create an Extended Events Session Using the Wizard (Object Explorer)


Workaround

Host the primary replica on a SQL Failover Cluster Instance (SQL FCI). In the event of a failover, SQL Server restarts and the initiator will re-establish connectivity with the availability group listener and the TARGET will re-establish connections with the initiator. There will be no residual broker connections or in-flight transactions that cannot be resolved.

In this scenario, you still have a secondary replica hosted on a standalone instance of SQL Server which provides another means of high availability.

 

clip_image004

 

Maintenance

When failing manually over between availability group replicas for rolling upgrade purposes, stop the Service Broker Application before doing the manual availability group failover to prevent any in-flight transactions from getting stuck.


Comments (12)

  1. Dave Ballantyne says:

    Stopping and Starting the Endpoints as suggested in the workarounds on the connect item on failover ( using an alert ) work fine.

    1. Not always, only if at the moment of failower the queue (sys.transmission_queue) was empty

      1. Dave Ballantyne says:

        As we have now found out 🙂

  2. Jason Coombes says:

    Did this just get fixed in SQL 2014 SP1/CU12 & SP2/CU5?

    https://support.microsoft.com/en-us/help/4016361

    Seems to describe the issue above so I think so.

      1. Hurried, don’t working

        1. Jason Coombes says:

          That’s a shame. It does seem that this fix describes this exact problem though?

        2. Jason Coombes says:

          Looking at the Connect item you raised it seems you are using “Microsoft SQL Server 2016” and the fix I linked to relates to “SQL 2014 SP1/CU12 & SP2/CU5” so might explain why you’re not seeing the fix?

    1. Jason Coombes says:

      That fix just got updated to include SQL 2016 RTM/CU6 & SP1/CU3.

      1. Yes!!! Fixed – Microsoft SQL Server 2016 (SP1-CU3)

        1. chad says:

          I installed SQLSERVER2016 SP1-CU3 FOR Primary Node and Secondary Node, the issue is still the same for me

Skip to main content