SQL Server 2012 AlwaysOn – Part 8 – Failover Mechanism with SAP Netweaver

[Edit] You can find the summary of the complete series here.

In part 8 of our series we want to look a bit behind the scenes how the SAP reconnect works together with the AlwaysOn. In part #7 we referenced already to failover exercises. We need to distinguish two parts. First is the SAP part of detecting an issue with a connection and second the SQL Server part which handles a failover of the primary replica to another node. Let's look at the SAP part first.

SAP Reconnect Mechanism

SAP work processes each open their own connections to SQL Server. Around 15 years ago, SAP implemented a reconnect mechanism for their work processes which was intended to be used with Failover Cluster solutions like WSFC. The principle mechanism works like:

  • A SAP process which issues a database request or has a database request open is receiving an error message.
  • The Database Dependent Interface of the SAP ABAP stack analyzes the error code against a series of predefined error codes which usually signal a breakdown of the database connection. If the error code matches one of those error codes, the SAP process is set into reconnect status and the business transaction on the SAP side is rolled back. The break and reset of the business transaction also results in a ABAP shortdump
  • The particular work process now is trying to reconnect to the database a number of times. Between the reconnect attempts, it pauses. The # of attempts and the pause are configurable with the following parameters which in current SAP Basis releases do have the following default values:
    • rsdb/reco_trials = 3
    • rsdb_reco_sleep_time = 5 - the unit given is in seconds
  • If the reconnect is successful, everything is fine again. The connection issue might have been caused by a network glitch. Everything will work again.
  • If the work process couldn't reconnect, it stays up and running, but without connection to the database. The request on the work process is resulting in an ABAP short dump. The particular work process will remain in reconnect state. Additionally all other work processes of the SAP ABAP instance are set into reconnect status as well.
  • All the work processes will now check their connection state with a short ping to the database executing a short query. If the connections are bad, they'll remain passive and not attempt a reconnect to the database up to the moment when a request is assigned to the work process which requires issuing a query against the database.
  • At this point in time the particular work process will attempt to reconnect again against the database. If the attempt is successful, the particular work process will move out of the reconnect state.
  • If the attempt is not successful, the request on the work process is answered with a short dump again and the loop of setting all the work processes into reconnect state does start again with all the processes waiting again on a request to be assigned to attempt a reconnect.

Consequences of this implementation of reconnecting:

  • Smaller glitches can be compensated w/o issuing an ABAP short dump and forcing a reconnect in case the connection break does not happen in the middle of the execution of a SQL statement and another workprocess already ran into the issue of having lost the connection
  • In case of a failover of WSFC or AlwaysOn where all processes are getting disconnected, not ALL the work processes will immediately reconnected. According to the logic described above and the time it takes to failover, it is very likely that most of the work processes remain in reconnect state up to the point where a new SQL request needs to be issued on one of the connections
  • In case of a failover, it is expected to at least get one ABAP short dump per instance. That would be an in-flight or a query which is first executed on one of the connections after a dis-connect or connection loss.
  • All 'transactions'/'screens' or batch jobs on the SAP application side are rolled back because the SAP NetWeaver logic is assuming that transactions issued through a disconnected connection on the DBMS side got rolled back as an effect of the disconnect. Thereby it does not matter whether the reconnect will target the same server or another server that is part of a High Availability configuration like SQL Server AlwaysOn. Also the reason for the disconnect does not make a difference in the 'roll back' behavior on the SAP application side and its assumption that disconnect leads to a roll back of the DBMS side transaction that got initiated through the connection that got disconnected.

Documentation on the SAP reconnect mechanisms can be found:

https://help.sap.com/saphelp_nw70/helpdata/en/08/5744774ae611d1894f0000e829fbbd/content.htm

OSS note #98051 – Database Reconnect: Architecture and Function

OSS note #24806 – Database Reconnect: Technical Settings and Configuration

Failing Over

Now let's see how this looks in reality and attempt a failover. First let's look at a controlled failover. For this purpose we go into SSMS of the instance running the primary replica. Right click on the 'Failover' item to get to the Failover wizard.

 

As next few steps, we go through the failover wizard. In a first step we need to choose the secondary replica we want to failover to. In the wizard, the asynchronous replicas are shown with a warning about the possibility of data loss. In our case we checked to failover to the synchronous replica.

 

A next step asks to connect to the secondary replica and failover target.

After connecting and advancing to the next step, a summary dialog is shown like:

With pressing 'Next', we are executing the failover. At the end of the step, we will see this screen:

SQL Server Errorlog

Looking into the Errorlog of the SQL Server instance we failed over to, we are seeing this sequence of log entries which describe the following sequence (even w/o trace flag):

  • The secondary replica is moving from state 'SECONDARY_NORMAL' to 'RESOLVING_PENDING_FAILOVER'
  • The secondary replica is moving from state to 'RESOLVING_PENDING_FAILOVER' to 'RESOLVING_NORMAL' due to a request from WSFC
  • Connection to the Primary replica is terminated
  • Secondary database moves from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'
  • Starting to listen to the SQL Server listener is commented. Means the Windows Cluster Network resource of the Listener got brought up on the server which is running the new primary replica
  • The state of the former secondary is changing from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'
  • The state of the former secondary is changing from RESOLVING' to 'PRIMARY'
  • In this state you'll find messages in the Errorlog telling that the database recovery has been executed. Also telling the number of transactions (transactions which performed Data Manipulations) being rolled back and the time it took
  • Messages that the new primary replica build up the connection to the now secondary replica (old primary) and eventual other secondary replicas

There might be some other messages within between telling that Service Broker is disabled or other messages around WSFC cluster triggering the failover. But in essence the sequence of steps above with different states between SECONDARY and PRIMARY is shown.

The SQL Server Errorlog of the new Primary Replica could look like (important lines marked in yellow):

The state of the local availability replica in availability group 'ERPPRODE64' has changed from 'SECONDARY_NORMAL' to 'RESOLVING_PENDING_FAILOVER'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

AlwaysOn: The local replica of availability group 'ERPPRODE64' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

The state of the local availability replica in availability group 'ERPPRODE64' has changed from 'RESOLVING_PENDING_FAILOVER' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

AlwaysOn Availability Groups connection with primary database terminated for secondary database 'E64' on the availability replica with Replica ID: {2b709837-222f-41b0-98fb-28f0ec8adb7c}. This is an informational message only. No user action is required.

The state of the local availability replica in availability group 'ERPPRODE64' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

Started listening on virtual network name 'ERPPRODE64vn'. No user action is required.

The availability group database "E64" is changing roles from "SECONDARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

Error: 983, Severity: 14, State: 1.

Unable to access database 'E64' because its replica role is RESOLVING which does not allow connections. Try the operation again later.

Error: 983, Severity: 14, State: 1.

Unable to access database 'E64' because its replica role is RESOLVING which does not allow connections. Try the operation again later.

Nonqualified transactions are being rolled back in database E64 for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

The state of the local availability replica in availability group 'ERPPRODE64' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

The availability group database "E64" is changing roles from "RESOLVING" to "PRIMARY" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

The Service Broker endpoint is in disabled or stopped state.

0 transactions rolled back in database 'E64' (5:0). This is an informational message only. No user action is required.

Recovery is writing a checkpoint in database 'E64' (5). This is an informational message only. No user action is required.

A connection for availability group 'ERPPRODE64' from availability replica 'SAPDENALI5' with id [1497D576-0601-4557-82A1-746010A4ECE8] to 'SAPDENALI6' with id [2B709837-222F-41B0-98FB-28F0EC8ADB7C] has been successfully established. This is an informational message only. No user action is required.

A connection for availability group 'ERPPRODE64' from availability replica 'SAPDENALI5' with id [1497D576-0601-4557-82A1-746010A4ECE8] to 'SAPDENALI7' with id [ECEADB7E-E1C7-416D-8249-5000B021D770] has been successfully established. This is an informational message only. No user action is required.

AlwaysOn Availability Groups connection with secondary database established for primary database 'E64' on the availability replica with Replica ID: {2b709837-222f-41b0-98fb-28f0ec8adb7c}. This is an informational message only. No user action is required.

AlwaysOn Availability Groups connection with secondary database established for primary database 'E64' on the availability replica with Replica ID: {2b709837-222f-41b0-98fb-28f0ec8adb7c}. This is an informational message only. No user action is required.

AlwaysOn Availability Groups connection with secondary database established for primary database 'E64' on the availability replica with Replica ID: {2b709837-222f-41b0-98fb-28f0ec8adb7c}. This is an informational message only. No user action is required.

AlwaysOn Availability Groups connection with secondary database established for primary database 'E64' on the availability replica with Replica ID: {eceadb7e-e1c7-416d-8249-5000b021d770}. This is an informational message only. No user action is required.

AlwaysOn Availability Groups connection with secondary database established for primary database 'E64' on the availability replica with Replica ID: {eceadb7e-e1c7-416d-8249-5000b021d770}. This is an informational message only. No user action is required.

What you also will find a lot in this situation are message like these shown above:

Error: 983, Severity: 14, State: 1.

Unable to access database 'E64' because its replica role is RESOLVING which does not allow connections. Try the operation again later.

This is the result of SAP workprocesses going through their reconnect logic as described above and the former secondary already listening on the virtual name, but the database itself not yet being open. On larger SAP systems a few hundred of such messages can easily be written into the Errorlog within a few seconds before the database finally opens up

On the former primary, the sequence of events as documented in the SQL Server Errorlog would stack up like:

  • Message that a transition is going to be made on behalf of WSFC
  • Change of state from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'
  • Termination of connection to former secondary replicas
  • Message that the state changed from 'PRIMARY to 'RESOLVING'
  • Message that instance stopped listening on listener name and with that on IP address associated with listener
  • State change from 'RESOLVING_NORMAL' to 'SECONDARY_NORMAL'
  • Startup of the database(s) in the Availability Group again
  • State Change from 'RESOLVING' to 'SECONDARY'
  • Some recovery messages on the databases within the AG
  • Messages about the fact that the communication to the new primary replica running instances have been established.
  • Dependent on the time it takes to failover and to recover the new primary replica there might be messages which document a time out to connect to the new primary. If at the end the connection is established, everything is fine.

 

The SQL Server Errorlog of the former primary replica and now secondary replica could look like (again important entries marked in yellow):

AlwaysOn: The local replica of availability group 'ERPPRODE64' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

The state of the local availability replica in availability group 'ERPPRODE64' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

AlwaysOn Availability Groups connection with secondary database terminated for primary database 'E64' on the availability replica with Replica ID: {eceadb7e-e1c7-416d-8249-5000b021d770}. This is an informational message only. No user action is required.

AlwaysOn Availability Groups connection with secondary database terminated for primary database 'E64' on the availability replica with Replica ID: {1497d576-0601-4557-82a1-746010a4ece8}. This is an informational message only. No user action is required.

The availability group database "E64" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

Stopped listening on virtual network name 'ERPPRODE64vn'. No user action is required.

Nonqualified transactions are being rolled back in database E64 for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

Remote harden of transaction 'GhostCleanupTask' (ID 0x000000000d748930 0000:05a2815a) started at Jul 10 2012 9:36PM in database 'E64' at LSN (4889:54648:2) failed.

The state of the local availability replica in availability group 'ERPPRODE64' has changed from 'RESOLVING_NORMAL' to 'SECONDARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

A connection for availability group 'ERPPRODE64' from availability replica 'SAPDENALI6' with id [2B709837-222F-41B0-98FB-28F0EC8ADB7C] to 'SAPDENALI5' with id [1497D576-0601-4557-82A1-746010A4ECE8] has been successfully established. This is an informational message only. No user action is required.

The availability group database "E64" is changing roles from "RESOLVING" to "SECONDARY" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

Nonqualified transactions are being rolled back in database E64 for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

Nonqualified transactions are being rolled back in database E64 for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

SAP Developer Traces

On the SAP side, as mentioned, we find at least on SAP process issuing an ABAP short dump. Looking into a developer trace of a SAP workprocess the failover is recorded and commented as well.

SAP developer traces can be found under the path of <Drive:>\usr\sap\<SID>\DV...<inst. ID>\work on the local server running the SAP application instance. E.g. C:\usr\sap\E64\DV10\work. Every work process has a file called dev_<wp#> where the enumeration aligns with the list of work processes shown in SM50 for the particular instance

As comments in those developer traces we will see:

  • [HY000] [Microsoft][SQL Server Native Client 11.0]Unspecified error occurred on SQL Server. Connection may have been terminated by the server.

  • (10054) [08S01] [Microsoft][SQL Server Native Client 11.0]TCP Provider: An existing connection was forcibly closed by the remote host.

  • (10054) [08S01] [Microsoft][SQL Server Native Client 11.0]Communication link failure

  • Rollback Work (99)

  • LOG BYM=> severe DB error 10054 ; work process in reconnect status

  • Reconnect state is entered by connection

  • Several entries messaging: C ERROR: -1 in function SQLConnectWithRetry (SQLConnectWithRetry) [line 2065]

    • [42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]Unable to access database 'E64' because its replica role is RESOLVING which does not allow connections. Try the operation again later.
    • ERROR: -1 in function SQLConnectWithRetry (SQLConnectWithRetry) [line 2065]

    Up to the point where the reconnect is successful. Dependent at which time a new statement is supposed to get executed, a small number or none of those messages above might appear. If the failover is in flight, there will be more of those messages seen.

  • Using dynamic link library 'C:\usr\sap\E64\SYS\exe\uc\NTAMD64\dbmssslib.dll'

  • Network connection used from SAPDL5852 to ERPPRODE64vn;MultiSubnetFailover=yes using tcp:ERPPRODE64vn;MultiSubnetFailover=yes

  • Driver: sqlncli11.dll Driver release: 11.00.2100

  • GetDbRelease: 11.00.2100

  • GetDbRelease: Got DB release numbers (11,0,2100)

  • Connection 0 opened (DBSL handle 0)

  • successfully reconnected to connection 0

  • LOG BYY=> work process left reconnect

The latter sequence starting with the yellow marked message should show up twice since two connections are getting re-established. In order to keep the sequence readable, we cut quite a few messages out of the sequence. So don't be surprised to find some more messages between the messages shown here. But order of events is that a process detects a connection loss. Then it will try to reconnect and eventually switch into a reconnect status. You also might see a message looking like:

  • ERROR: -1 in function DoPing (SQLExecDirect failed) [line 10284]

    (10054) [08S01] [Microsoft][SQL Server Native Client 11.0]TCP Provider: An existing connection was forcibly closed by the remote host.

This would be the reaction of a work process which was not directly affected by experiencing a connection break. A process where the connections were idle at the point of the connection break. But a process which got signaled that it is in the 'reconnect' state and now checks for the status of its connection.

You also might find messages indicating that ABAP short dumps did happen. These would be the processes which had in flight requests to the database or have not yet been set in 'reconnect' state yet.

These were the more or less deep essentials of an AlwaysOn Failover on the SQL Server side and SAP side. In the next part, we will talk about how to monitor an AlwaysOn configuration. Compared to database mirroring, we introduced larger changes into the monitoring space of AlwaysOn which need some deeper explanations.