Reviewed by: Mike Weiner, Murshed Zaman
A fundamental part of ensuring application resiliency to failures is being able to tell if the application database(s) are available at any given point in time. Synthetic monitoring is the method often used to implement an overall application health check, which includes a database availability check. A synthetic application transaction, if implemented properly, will test the functionality, availability, and performance of all components of the application stack. The topic of this post, however, is relatively narrow: we are focused on checking database availability specifically, leaving the detection of functional and performance issues out of scope.
Customers who are new to implementing synthetic monitoring may choose to check database availability simply by attempting to open a connection to the database, on the assumption that the database is available if the connection can be opened successfully. However, this is not a fully reliable method – there are many scenarios where a connection can be opened successfully, yet be unusable for the application workload, rendering the database effectively unavailable. For example, the SQL Server instance may be severely resource constrained, required database objects and/or permissions may be missing, etc.
An improvement over simply opening a connection is actually executing a query against the database. However, a common pitfall with this approach is that a read (SELECT) query is used. This may initially sound like a good idea – after all, we do not want to change the state of the database just because we are running a synthetic transaction to check database availability. However, a read query does not detect a large class of availability issues; specifically, it does not tell us whether the database is writeable. A database can be readable, but not writeable for many reasons, including being out of disk space, having incorrectly connected to a read-only replica, using a storage subsystem that went offline but appears to be online due to reads from cache, etc. In all those cases, a read query would succeed, yet the database would be at least partially unavailable.
Therefore, a robust synthetic transaction to check database availability must include both a read and a write. To ensure that the storage subsystem is available, the write must not be cached, and must be written through to storage. As a DBMS implementing ACID properties, SQL Server guarantees that any write transaction is durable, i.e. that the data is fully persisted (written through) to storage when the transaction is committed. There is, however, an important exception to this rule. Starting with SQL Server 2014 (and applicable to Azure SQL Database as well), there is an option to enable delayed transaction durability, either at the transaction level, or at the database level. Delayed durability can improve transaction throughput by not writing to the transaction log while committing every transaction. Transactions are written to log eventually, in batches. This option effectively trades off data durability for performance, and may be useful in contexts where a durability guarantee is not required, e.g. when processing transient data available elsewhere in case of a crash.
This means that in the context of database availability check, we need to ensure that the transaction actually completes a write in the storage subsystem, whether or not delayed durability is enabled. SQL Server provides exactly that functionality in the form of sys.sp_flush_log stored procedure.
As an example that puts it all together, below is sample code to implement a database availability check.
First, as a one-time operation, we create a helper table named AvailabilityCheck (constrained to have at most one row), and a stored procedure named spCheckDbAvailability.
CREATE TABLE dbo.AvailabilityCheck ( AvailabilityIndicator bit NOT NULL CONSTRAINT DF_AvailabilityCheck_AvailabilityIndicator DEFAULT (1), CONSTRAINT PK_AvailabilityCheck PRIMARY KEY (AvailabilityIndicator), CONSTRAINT CK_AvailabilityCheck_AvailabilityIndicator CHECK (AvailabilityIndicator = 1), ); GO CREATE PROCEDURE dbo.spCheckDbAvailability AS SET XACT_ABORT, NOCOUNT ON; BEGIN TRANSACTION; INSERT INTO dbo.AvailabilityCheck (AvailabilityIndicator) DEFAULT VALUES; EXEC sys.sp_flush_log; SELECT AvailabilityIndicator FROM dbo.AvailabilityCheck; ROLLBACK;
To check the availability of the database, the application executes the spCheckDbAvailability stored procedure. This starts a transaction, inserts a row into the AvailabilityCheck table, flushes the data to the transaction log to ensure that the write is persisted to disk even if delayed durability is enabled, explicitly reads the inserted row, and then rolls back the transaction, to avoid accumulating unnecessary synthetic transaction data in the database. The database is available if the stored procedure completes successfully, and returns a single row with the value 1 in the single column.
Note that an execution of sp_flush_log procedure is scoped to the entire database. Executing this stored procedure will flush log buffers for all sessions that are currently writing to the database and have uncommitted transactions, or are running with delayed durability enabled and have committed transactions not yet flushed to storage. The assumption here is that the availability check is executed relatively infrequently, e.g. every 30-60 seconds, therefore the potential performance impact from an occasional extra log flush is minimal.
As a test, we created a new database, and placed its data and log files on a removable USB drive (not a good idea for anything other than a test). For the initial test, we created the table and the stored procedure as they appear in the code above, but with the call to sp_flush_log commented out. Then we pulled out the USB drive, and executed the stored procedure. It completed successfully and returned 1, even though the storage subsystem was actually offline.
For the next test (after plugging the drive back in and making the database available), we altered the procedure to include the sp_flush_log call, pulled out the drive, and executed the procedure. As expected, it failed right away with the following errors:
Msg 9001, Level 21, State 4, Procedure sp_flush_log, Line 1 [Batch Start Line 26] The log for database 'DB1' is not available. Check the event log for related error messages. Resolve any errors and restart the database. Msg 9001, Level 21, State 5, Line 27 The log for database 'DB1' is not available. Check the event log for related error messages. Resolve any errors and restart the database. Msg 3314, Level 21, State 3, Line 27 During undoing of a logged operation in database 'DB1', an error occurred at log record ID (34:389:6). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database. Msg 3314, Level 21, State 5, Line 27 During undoing of a logged operation in database 'DB1', an error occurred at log record ID (34:389:5). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database. Msg 596, Level 21, State 1, Line 26 Cannot continue the execution because the session is in the kill state. Msg 0, Level 20, State 0, Line 26 A severe error occurred on the current command. The results, if any, should be discarded.
To summarize, we described several commonly used ways to implement database availability check from the application tier, and shown why some of these approaches are not fully reliable. We then described a more comprehensive check, and provided sample implementation code.