Log Reader Agent Fails when the Sql server instance is failed over to Node 2

We have observed a number of case where we use sql server replication on a clustered instance of sql server and when we try to failover from Node 1 to Node 2 the Log Reader agent may fail to start with the following error

Error messages:

The process could not execute 'sp_MSpub_adjust_identity' on 'CRPSCSMSQ69V1\PUB'. (Source: MSSQL_REPL, Error number: MSSQL_REPL20011)

Get help: <<https://help/MSSQL_REPL20011>>

An error occurred during decryption. (Source: MSSQLServer, Error number: 15466)

Get help: <<https://help/15466>>

The process could not execute 'sp_MSpub_adjust_identity' on 'CRPSCSMSQ69V1\PUB' . (Source: MSSQL_REPL, Error number: MSSQL_REPL22037)

Get help: <<https://help/MSSQL_REPL22037>>

As seen above the root cause of the issue is the error

An error occurred during decryption”

Which mean when the sql server instance fails over to Node 2 it is not able to decrypt the stored procedure sp_MSpub_adjust_identity due to which it is not able to execute the stored procedure sp_MSpub_adjust_identity which in turn causes Log Reader Agent to fail on Node 2. It was not able to decrypt the stored procedure as the Service Master Key on Node 2 was different than the Service Master Key on Node 1.

The Service Master Key is the root of the SQL Server encryption hierarchy. It is generated automatically the first time it is needed to encrypt another key. By default, the Service Master Key is encrypted using the Windows data protection API and using the local machine key. The Service Master Key can only be opened by the Windows service account under which it was created or by a principal with access to both the service account name and its password. Regenerating or restoring the Service Master Key involves decrypting and re-encrypting the complete encryption hierarchy.

Service Master Key automatically encrypts all secure system data such as:
Database Master Keys
Linked Server passwords
Provider strings stored in sysservers
Credentials Secrets

So we can resolve the issue in following ways

Solution 1

========

One of the ways to resolve the above error is to backup the service master key from Node 1 and restore the same on Node 2

BACKUP SERVICE MASTER KEY TO FILE = 'path_to_file' ENCRYPTION BY PASSWORD = 'password' ---------------- (Use this command to backup the service master key on Node 1 )

RESTORE SERVICE MASTER KEY FROM FILE = 'path_to_file' --------------------------- (Use this command to restore the service master key on Node 2)
DECRYPTION BY PASSWORD = 'password' [FORCE]

Solution 2

========

Another solution is to regenerate the service master key on Node 2 with Force option using the following command

ALTER SERVICE MASTER KEY REGENERATE FORCE

This solution should be used with precaution because in this case when you fail back to Node 1 the Log Reader agent will fail again with the same error.
Again, when we regenerate the Service Master Key forcibly, the Linked servers on the server can fail as the password for the security logins cannot be decrypted and hence we may have to recreate the Linked servers or change the password for the Linked servers which can be tedious and undesirable.

Solution 3

========

The above issue occurs since the service account of the clustered instance of sql server was changed using the Windows services console (services.msc). For a clustered instance of the sql server we recommend changing the service account using the sql server configuration manager because it decrypts all the secure system data using the old service master key, generates a new service mastery key based on the new startup account and re-encrypt all the system data with the new service master key. It also checkpoints all the information in the quorum so that the changes gets replicated to Node 2 when the sql server instance is failed over to Node 2. Hence the above error does not occur when the sql server startup account is changed using sql server configuration manager. When we use sql server configuration manager to change the service account for sql server we are sure that the same account will be used as sql server startup account on all the nodes of the cluster.

So to resolve the above error we need to change the Service Account to another account using the Configuration Manager on Node 1 and try to failover on Node 2 and check whether the Log Reader Agent worked fine. We can then change the Service Account back to the original account using the Configuration manager and try to failover again and the replication should work fine.

We can refer to the following links to determine the what type of service accounts should be used and how to change the service account for Sql server instance in a cluster

Reference

=========

https://msdn.microsoft.com/en-us/library/ms345578.aspx

https://support.microsoft.com/kb/254321/

https://support.microsoft.com/kb/239885/

Parikshit Savjani
SE, Microsoft SQL Server

Reviewed By Anurag Sharma, SQL Server Escalation Services