Making Service Broker Application Highly Available With AlwaysOn

 

If you have a Service Broker (SSB) application connecting to SQL Server using an AlwaysOn availability group listener, in the event of an unexpected failover, some messages may be lost or stuck in the transmission queue on the old primary (new secondary) after the failover. This could be an automatic failover or manual failover when the SSB application is not stopped.

Incoming (SSB) connections are severed during a availability group failover and those connections use the listener IP address which directs connections to the new primary replica. New incoming SSB connections, established through the listener, will go to the new replica. This is the correct and intended behavior.

 

clip_image001

 

The availability group fails over and the new inbound SSB connections reset to the new primary replica:

 

clip_image002

 

However, outgoing SSB connections from the old primary replica are left open after a failover, still in session with the initiator, and remain so because the IP addresses are still valid. New outgoing SSB connections are established from the new replica to the initiator, but because the connections from the old primary are still open, incoming SSB messages are still sent to the old primary.

Since Service Broker components are not contained within the availability group, special handling is required to handle the Service Broker connections when the availability group moves to a different replica instance. If you feel the post failover handling for the Service Broker connection is not feasible to be built into your application, please submit a Connect item (https://connect.microsoft.com/sql) so that we are able to understand the community need for this functionality. As an alternative to modifying the code for your application, you could choose to use a SQL Server Failover Cluster Instance which will prevent the issue described above, due to the IP Address remaining the same on failover.

 

clip_image003

 

UPDATE: Enhanced Compatibility When Hosting Service Broker in Availability Groups

This behavior has been improved in the following versions of SQL Server:

SQL 2014 SP1/CU12 & SP2/CU5

SQL 2016 RTM/CU6 & SP1/CU3

SQL 2012 SP4

For more information on these improvements, see the following KnowledgeBase article:

FIX: Service Broker endpoint connections aren't closed after an availability group failover in SQL Server

Troubleshooting

If, after the failover, you run the following query on the old primary SSB TARGET and see the value for is_accepted = 0 then you know you have encountered the issue. You will also likely have two or more connections returned here. 

--look for is_accept = 0
select is_accept, * from sys.dm_broker_connections with(nolock)

 

Another side effect are messages stuck in the sys.transmission_queue on the old primary SSB TARGET.

If Profiler trace is run on the old primary SSB TARGET, you would see the following error:

 

The message has been dropped because the service broker in the target database is unavailable: ‘The database is in read-only mode.’

clip_image002[1]

 

SQL Server XEvent ‘broker_message_undeliverable’ can also be collected using SQL Server extended events and will also report that messages cannot be delivered. For more information on using SQL Server extended events, see:

Create an Extended Events Session Using the Wizard (Object Explorer)

Workaround

Host the primary replica on a SQL Failover Cluster Instance (SQL FCI). In the event of a failover, SQL Server restarts and the initiator will re-establish connectivity with the availability group listener and the TARGET will re-establish connections with the initiator. There will be no residual broker connections or in-flight transactions that cannot be resolved.

In this scenario, you still have a secondary replica hosted on a standalone instance of SQL Server which provides another means of high availability.

 

clip_image004

 

Maintenance

When failing manually over between availability group replicas for rolling upgrade purposes, stop the Service Broker Application before doing the manual availability group failover to prevent any in-flight transactions from getting stuck.