What did we learn using Database Mirroring over the last two years in our SAP ERP system. Second revision
Well, back from vacation and on travel again. In preparation for the SQL PASS conference where I had to do several presentations, I sat down with our Microsoft IT folks running our SAP landscape and discussed on what we learned the last two years running our SAP ERP system with synchronous Database Mirroring with failover. Here the essence of what I got told and what I presented at the conference as well.
Automatic failover worked great with the SAP ABAP stack. We had around half a dozen occurrences of unexpected failovers driven by hardware issues. Only a part of those were actual server issues. Some of those also were issues of the storage subsystem. Especially the problems in the storage subsystem would not have been compensated by Microsoft Cluster Services. Means we encountered situations where Database Mirroring really saved us from longer down times due and eventual data loss. Observing the behavior of the SAP work processes while failing over, we however got a bit surprised by the way this is done. The logic of the database independent part of SAP which needs to trigger reconnect attempts to the database works as follows:
· If the database connection is lost on a work process which currently runs a transaction or a request against the database, the transaction is rolled back on the ABAP side and the work process actively and immediately tries to reconnect. This is the phase where the SQL Server client library will re-direct new attempts to the Mirror server. Besides actively trying to reconnect, the work process also will set all other work processes in a reconnect state
· Work processes which have not been actively running a User request at the time the connection to the Principal server got lost, are not actively attempting a reconnect immediately. The reconnect is attempted as soon as the next request gets assigned to such a process. Either the process then has been set to the ‘reconnect status’ by another work process or the work process being the first one of the instance to recognize the vanished principle will attempt a reconnect based on the failure message when issuing the first database statement and then proceed according to the description above.
The reconnect mechanism of the SAP ABAP stack is documented well in OSS note #98051. The result of this reconnect mechanism might be confusing when you look into the two SQL Server 2005 DMVs sys.dm_exec_sessions or sys.dm_exec_connections. Immediately after the failover the number of connections will be very low and will then over the next minutes and hours increase. Could even be that some work processes will only reconnect after days.
On a second item, our IT stuff got a bit surprised about one of the edge cases of Database Mirroring. Imagine the following situation:
· The Principal server fails and all the SAP instances failover as described above. The ETA to get the Principal up and running again is in days and the transaction log space on the Mirror which now is acting as Principal is slowly getting filled up
· Due to the threat of running out of transaction log space on the Mirror (which now acts as Principal) it is decided to switch off Database Mirroring
· The SAP processes connected with the Mirror already are fine and continue to work without any problem.
· However as soon as a work process restarts or as one tries to transport corrections and changes into the system, the associated SAP processes now will report database connection problems and the fact that they are not able to connect to the database
· The connection problem is a result out of the following facts:
o The restarting SAP process or R3trans.exe will get the server definition out of unchanged profiles pointing to the Principal server first and to the Mirror as a FailoverPartner. Hence the processes try to connect to the Principal (which is out of service) and rely on SQL Server SNAC client to do its job of contacting the Mirror server in order to establish a connection
o However on the SQL Server side, the former Mirror which at this point does runs the load and which at this point in time is not mirroring the database anymore (since Database Mirroring got switched off) does not remember that it ever was in Mirroring relationship and does refuse a connection for the role as ‘FailoverPartner’. Means the SQL Server SNAC client’s attempts to connect to the Mirror as Failover-Partner is rejected.
This issue only so far only could be resolved by changing all the SAP profiles containing the definition of the database server (default.pfl and transport profiles) as well as the user environment used for the service startup and then restart the associated SAP instances.
With the Cumulative Update build #4 of SQL Server 2005 SP2, SQL Server will allow ‘FailoverPartner’ connections to the former Mirror with Database Mirroring being off. The KBA will be published around the time the CU4 is released and will have the #936179. This solution works the same way for all later versions and releases of SQL Server. Please be aware that you need to enable trace flag 1449 on the principal and mirror in order to have the solution working. The mentioning of the trace flag is a bit buried in the KBA.
On a third item, our IT staff needed to adapt their way in handling situations where DBM is getting suspended a bit after one particular situation. Imagine the following scenario:
· You want to roll out some patches or want to perform short hardware maintenance. For this purpose mirroring gets suspended
· Means the Principal now works alone with the Witness still up and running. Principal and Witness are in sync and aware of the fact that Mirroring is suspended
· Now somebody switches off the Witness server or stops the Witness instance. This means out of the environment of Principal, Mirror and Witness only the Principal is still active, but mirroring with automatic failover was not switched off. At this moment the Principal will close the mirrored database. This behavior is intended since the Principal now needs to assume that either the Principal instance or server has a problem (since it doesn’t see the Witness anymore) or that the Witness has a problem. But the Principal does not have any chance to check which of the conditions is true. Despite the fact that Mirroring is suspended, the Principal will take the step and close its mirrored database in such a situation
The change in handling and avoiding such a situation is a very easy one. Before mirroring is getting suspended, we now switch of the automatic failover option, by disabling the witness. This can be done via the Database Mirroring Wizard in SQL server Management Studio or by executing this command:
ALTER DATABASE <db_name> SET WITNESS = ‘OFF’
After the Witness is disabled, Database Mirroring can be suspended with the Principal database remaining open and accessible. Getting to the state of synchronous mirroring with failover, the steps to be done are:
· Resuming mirroring again
· After the synchronization finished, the witness can be declared again in the DBM Wizard or by an alter database statement. The way to monitor whether the synchronization is finished the following query can be run on the Principal or on the Mirror:
SELECT DB_NAME(DATABASE_ID), MIRRORING_STATE_DESC, MIRRORING_ROLE_DESC from sys.database_mirroring
This information can also be retrieved via SQL Management Studio: Status Synchronizing switched to Synchronized as soon as the databases are in sync again.
That is it for this time. Next time around SQL Server Database Mirroring, I’ll show sample scripts to keep the scheduled tasks working independent of the server which currently is active.