AlwaysOn - HADRON Learning Series - Automatic Page Repair Increases SQL Server High Availability (HA) Capabilities

WARNING: The series is based on pre-release software so things could change but I will attempt to provide you with the best information I can!

The enterprise edition of SQL Server has provided page level repair for quite some time now.  

For those not familiar with page level repair allow me to provide a brief overview.

If a page is determined to be damaged (823, 824 - like checksum failure) during runtime the page is marked suspect and added to the suspect page table. All attempts to access to this page will return an error but the rest of the database remains usable. The DBA can then restore a series of backups with the page repair option. The page will be recovered from the backups and returned for use to the database, online. This restore sequence can be a daunting task but it is a nice way to avoid a full restore.

To help you with this type of restore SQL Server Denali is adding a 'Restore Assistant' to help you build various restore steps. For example, when you elect to restore a database in management studio (SSMS) you can select the timeline option and use the sliders.

image

AlwaysOn provides similar capabilities, online to retain the high availability of your database.

Page Damaged In Primary

image

  1. When a page becomes damaged (823, 824) on the primary, access to the page is prevented. Attempts to access the page result in error.
  2. A broadcast is made to all secondary replicas asking for a copy of the page at the primary's, current LSN.
  3. When the secondary has redone all the log records to catch up to the primary an image of the page is then sent to the primary.
  4. The first image received by the primary replaces the damaged image and responses from other secondary's are then ignored

This is essentially the behavior of page restore. The secondary restores the page to the proper LSN level and responds to the primary with a valid copy. The repair time is based on the time it takes for a secondary to complete the redo to catch up to the primary.

Page Damaged On Secondary

image

  1. When a page becomes damaged (823, 824) on the secondary, access to the page is prevented. Attempts to access the page result in error.
  2. A request is made to the primary for the current copy of the page.
  3. The primary responds with the current copy of the page which is held in memory an only allows redo access to the page.
  4. Redo continues on the secondary and applies the log records
  5. When redo reaches the LSN level, on the page that was retrieved from the primary, the page can then be used by all workers.

This is essentially the behavior of page restore. The secondary restores the page and makes sure redo has advanced in the log records, far enough to match the image. When redo reaches the same LSN level the database is transitionally consistent and access to the page can again be granted.

Bob Dorr - Principal SQL Server Escalation Engineer