With the release of SharePoint 2013, Microsoft released a new platform for workflows called Workflow Manager (WFM). As of this writing the current version is 1.0 Cumulative Update 3. Unfortunately disaster recovery (DR) for this product is not as straight forward as just setting up database replication.
Following are a list of resources I’ve used to implement disaster recovery:
I found that each of the above references hold vital clues to making DR for WFM work, but none of them had details upon which I was stumbling. There are two basic concepts where I needed to do additional research:
- Certificates (which ones to use where and how to restore effectively)
- Changing service accounts and admin groups upon a failover
As pointed out there are plenty of TechNet articles and blogs that talk about how to do WFM Disaster Recovery (DR), so I am not going into detail on the individual steps, but I decided to document my discoveries in hopes that others can benefit from my experiences.
So, at a high level, the basic operation is as follows. I’ll have sections below describing each of the areas where I had concerns:
- Install production WFM and configure
- Configure your backup/replication strategy for the WF/SB databases
- Install WFM in DR
- Execute the failover process
- Re-connect SharePoint 2013
- (Optional) Changing RunAsAccount and AdminGroup
Install Production WFM and Configure
Certificates – AutoGenerate or custom Certs?
Installing WFM 1.0 CU3 is fairly well documented in several places, but the one piece that I feel needs to be called out is regarding certificate configuration. There are options to Autogenerate your certificates (self-signed), to use your own domain certificates, or to use certs acquired from a 3rd party certificate authority. There are some businesses who have no restrictions against self-signed certs, but this will affect your restoration of service in the DR environment. As noted in Spencer’s blog, there are a total of six or seven possible certificates. Auto-generating your WFM certificates will dictate your restoration process in a failover scenario. One reason for this is that the WorkflowOutbound certficate is created with private keys, but they non-exportable.
Configure Your Backup/Replication Strategy for the WF/SB Databases
The key to disaster recovery with WFM (as with many products) is the data store. In this case we are referring to the SQL Server databases. Again, this information is in the related links and there are two things to keep in mind:
- You can use pretty much any replication method – backup/restore, mirroring, log shipping -- except for SQL Server 2012 AlwaysOn, which is unsupported at this time. It is also crucially important to keep the WF/SB databases backed up as close in time as possible as the content databases in order to preserve the WF instance integrity.
UPDATE: With the release of Workflow Manager CU 4, SQL AlwaysOn is now supported and should be considered as the High Availability/Disaster Recovery solution. You can find information on CU4 here. And you can find installation information here.
- You do not need to backup the management databases, WFManagementDb and SBManagementDb, as they will be re-created during the recovery process.
Install WFM in DR
Depending on whether you want a cold or warm standby WFM farm, you will either have already installed the servers or will perform this as part of your recovery process. NOTE: WFM does *not* support a hot standby configuration. There are a couple of keys to your DR installation:
- You will install the bits on the DR app servers, but you will *not* configure the product at this time.
- If you are choosing to do a warm standby, then you may also import the necessary certificates ahead of time.
- If you are using:
- Auto-generated certificates, then it’s important to know that you need to export/import the Service Bus certificates from Prod to DR and for the Workflow Manager certificates you can auto-generate them in DR (remember you cannot import/export the WF certificates because the private keys are marked as non-exportable)
- Custom domain certificates, then you will export/import all of them from Prod to DR
- If you are using:
- The Service Bus root certificate should be imported into the LocalMachine\TrustedRootAuthorities store.
- The other Service Bus certs should be imported into the LocalMachine\Personal store.
Executing the Failover Process
In the event of a disaster (or just a need to failover), the following process is required.
- Restore the 4+ SQL databases (WFResourceManagementDb, WFInstanceManagementDb, SBGatwayDatabase, SBMessageContainer01 – n) from prod_SQL to dr_SQL.
- Assuming the steps above have been followed to install WFM in DR, then you need to use powershell to restore the SB farm. If you were doing a true ‘cold standby’, then you need to install (but not configure) the SB/WF bits from Web Platform Installer.
- Restore the SBFarm, SBGateway, and MessageContainer databases and settings (do this on only one WFM node)
- The SBManagementDB will be created in DR during this ‘restore’ process
- The RunAsAccount *must* be the same as the credentials used in production
- Again, using powershell, run Add-SBHost on each node of the farm.
- If you used auto-generated certificates for the WFFarm in prod, then when you restore the WFFarm you will auto-generate new ones. However this also means that you may need to restore the PrimarySymmeticKey to the new SBNamespace.
- At this point, restore the WFFarm using powershell (do this on only one WFM node)
- Run Add-WFHost on each node of the farm.
At this point, the new WF Farm should be in a working state. You can test this by navigating to the endpoint in a browser and you should receive output similar to the image below:
Re-connect SharePoint 2013
If WF certificates were re-generated in DR, then you will need to recreate the SharePoint Trusted Root Authority. Export the WF SSL certificate and add it to the SharePoint farm using New-SPTrustedRootAuthority.
Create a new registration to the Workflow farm using Register-SPWorkflowService.
There is a cache of security trusts, so in order to see the change more immediately you will likely need to execute the timer job “Refresh Trusted Security Token Services Metadata feed.” with the following powershell:
Start-SPTimerJob –Identity ‘RefreshMetadataFeed’
(Optional) Changing RunAsAccount and AdminGroup
<this part is ‘coming soon’>
The process above should work in most (if not all) scenarios, but I welcome any comments if you encounter problems or challenges. I’ve spent many hours on this over the past 6 months off and on and it’s very possible that I’ve missed something.
I’ll add the last section about changing service accounts once I have the complete set of steps for WF accounts. Service Bus added powershell cmdlets, which makes this easier, but Workflow Manager has not as of yet.