The Zero Downtime SharePoint Patching Myth

Unfortunately it is not possible to update/patch SharePoint without occurring some amount downtime. So the only options available to us are to minimize downtime.

I think of downtime in two ways, 'not available' and 'reduced functionality'. The second obviously is more closely aligned to zero downtime however for a large farm it is difficult logistically to achieve. Reduced functionality is providing your users with a read only farm during the time in which the patching of the primary farm is occurring. This requires the entire primary farm to be swung to another farm with its content DBs set to read only mode. Once patching on the primary farm is complete traffic is redirected back to the primary farm.

Since we want to always keep our downtime window as small as possible we should always follow a good and well tested practice of upgrading a farm to ensure there are no surprises along the way that will impact our downtime. There are two phases of upgrade, laying down updated binaries and running psconfig. The second phase, running psconfig, is the one that is going to take the majority of the time and the time taken is directly proportional to the amount of site collections within the farm. Psconfig upgrades each content DB schema and in many installations this can take many hours to complete. We have found in our testing in many real world deployments that detaching all content DBs other than the CA content DB, running psconfig, and then attaching content DBs back can help reduce the amount of downtime needed when patching SharePoint. One myth that needs to be dispelled now; the reduction in downtime is not achieved because psconfig runs faster or gets better throughput when upgrading the Content DB.

As with most things in SharePoint there are rules around the DB Attach process:
1. Only one Content DB can be attached to a farm at any one time.
2. Once a Content DB has been attached to a farm all of its content is marked as updated and therefore will incur what is effectively a full crawl the first time search crawls this DB. More on this later.

 So taking into account rule #1 there are two ways we can optimize this process to reduce downtime, 1) Prioritize, attach content DBs that belong to Web Applications that are highly sensitive to downtime first and make them available, and 2)Use surrogates, build out additional worker farms which are used as surrogates to host attaching content DBs.   

 The process of using surrogate farms includes building one or more single server, throw away, farms that are running the same patch level you are upgrading your primary farm to. This process has been well documented here. This approach however has downsides such as the need for additional hardware, the additional time and effort to build out these farms, and the need for your SQL server to be able to handle the additional load.  

 Let's take a look at the steps of a typical upgrade with DB prioritization:

0. Announce and coordinate downtime with IT, users, etc.
1. Take farm offline, typically you are pulling the WFEs out of a load balancer or for the case of reduced functionality, swinging DNS settings to another read only farm replica.
2. Detach all content DBs from the farm.
3. Run WSS and if applicable MOSS upgrade patches on each server and choosing to not run psconfig. You are only going to run psconfig once, not once for each upgrade package.
4. After all servers have been patched run psconfig on the CA machine. The execution of psconfig will not take near as long because all the content DBs are detached.
5. For each additional machine in the farm and one at a time run psconfig.
6. At this point your farm, without any content DBs, is upgraded and only the content DBs require upgrade.
7. Starting with the web application that is most critical to get back into production start attaching its content DB(s). Once complete put this Web Application back in the load balancer and notify everyone it is back online.
8.Continue running through each additional Content DB until each is attached back into the farm.

I have a tool that I will be releasing soon named CDBManager that will help with the DB prioritization method of upgrade, specifically it allows you to:

  • Mass DB detach all the content DBs in a farm
  • Reorder content DBs by priority
  • Automate the attaching of content DBs and provides ETA of when each DB will be complete (important because there is not a progress indicator otherwise)
  • Manages the upgrade.log file by creating an upgrade.log file for each content DB attached. As you may know each time a Content DB is attached to a farm and upgraded a new upgrade log is either created or if one already exists it is appended too. The problem with this approach is that all your upgrade logging is in a single file and not split out by Content DB. CDBManager renames the upgrade.log file after each Content DB has completed attaching with the name of the Content DB. This makes it much easier to go back through each log and analyze what might have gone wrong within a certain Content DB on upgrade.

I have a couple of large enterprise customers that are testing the tool now. Once we get past any breaking issues I publish the bits.

One additional point about the prioritization upgrade approach; should a content DB fail to attach along the way for whatever reason you should continue to DB Attach the remaining content DBs. While the upgrade is progressing along you now have the opportunity to investigate and mitigate the issues with the failed content DB in parallel to the upgrade and once ready retry the DB attach operation.

 So time to revisit DB Attach rule #2. When a content DB is attached into a farm its ID is changed. This has the side effect of effectively marking each object within the content DB as changed. This means that when the crawler service hits this content DB, weather doing an incremental or a full crawl, it will effectively do a full crawl as it believes all the content has changed. The Infrastructure Update (IU) changes all of this and effectively takes this rule out of play. After installing the IU the Content DB is not longer changed. This means an incremental crawl after re-attaching a Content DB is really an incremental crawl. No more full crawls after detaching and attaching content DBs, yea!

The Infrastructure Update (IU) KB article also has this blurb:

 Improvements to the time that is required to update and upgrade Windows SharePoint Services sites.

So what does this mean? Any fix before the IU psconfig updates each site collection in the farm by updating its build version to reflect the most recent value. After installing the IU we only hit site collections to update them if a schema object of the site collection needs to be updated, such as an update to the template schema. This type of update is far less frequently since hotfixes rarely do site collection schema updates (IU however does to support the new search features). The end result is that if a fix does not require a schema update psconfig does not go through each site collection and update the build number, so this will drastically reduce the amount of time necessary to perform the upgrade.

So there you have it, while we cannot upgrade a live farm we do have processes and available fixes that will move us closer to the nirvana of a zero downtime upgrade.

Happy upgrading!