Yes, this has been covered already but this problem caused me real headaches on the last CU upgrade I did for my test farms and nothing I tried seemed to help. All the machines patched just fine but when it came to finishing the upgrade with the psconfig/SharePoint wizard operation it all collapsed on every machine, seemingly for no good reason.
Looking at the upgrade logs we see:
ERR Task configdb has failed with an unknown exception
ERR Exception: System.InvalidOperationException: Cannot start service SPAdmin on computer ‘.’. —> System.ComponentModel.Win32Exception: The service did not respond to the start or control request in a timely fashion
— End of inner exception stack trace —
at System.ServiceProcess.ServiceController.Start(String args)
at Microsoft.SharePoint.Win32.SPAdvApi32.StartService(String strServiceName)
Nothing really interesting there; the SharePoint admin service just timed out starting. However doing the same from services.msc also gave the same inexplicable timeout, even when changing the registry to allow for more time.
So to the point then; I’d seen from others this might’ve been an issue with network connectivity. Sure enough, when starting the service manually we see this network traffic coming from WSSADMIN.EXE:
In case you can’t see that’s basically the service sending TCP packets to two IP addresses (184.108.40.206 and 220.127.116.11), not getting anything back again, and retrying. A quick look up the conversation and we see where these IPs came from; the proceding DNS query & response show us what’s really going on:
The IPs belong to “crl.microsoft.com”; this being a certificate revocation URL for revocation checks . Since a while back SPAdmin will try to check with the Microsoft certificate authority for revoked Authenticode signatures as a security measure. This is normally good but when the machine in question has no line of sight to crl.microsoft.com then this fails; in this case causing a service timeout, and stopping psconfig from finishing the upgrade & bringing the server back online after a patch. This can have the unintended effect of leaving a half-baked SharePoint farm so is very far from ideal.
But Why Does SPAdmin Fail to Start?
Pinging the IP address from the machine in question tells us why – 18.104.22.168 for example. We get a response from a router saying the address isn’t accessible – that it has no route to that address. This has the effect of the host not replying whatsoever so the connection times-out. As is normal the connection is retried a few times, but the problem is this is all as the service is starting so Windows is waiting for SPAdmin to say “I’m started!” which it never gets round to doing in time.
Why the no-route? Well, just like many production environments we’ve blocked off these servers from getting further than the internal router unless it goes via a web-proxy. Standard security stuff – anything not proxy aware stops at the network boundary, and anything through the proxy is scanned for security threats in a single point of control.
Anyway, I needed a quick solution to this problem as I’m sure many other have too. CRL checking can be disabled in group-policy as far as I’m aware (I’m not a domains engineer) but even if that was true, all my machines were down and needing a quick & dirty hack to bring them back to life again. Even worse, this error was happening on newly installed VMs too just trying to join the farm.
The (Dirty) Solution
Add crl.microsoft.com to the hosts-file of the failing machine to point to 127.0.0.1. The service then started fine and the upgrade could complete. This isn’t ideal and I don’t recommend it as a long-term solution but it should work well enough to get up & stumbling again. If for some reason this doesn’t work try clearing DNS cache and repeating. If that doesn’t work use the same network tracing process to figure out what the service is trying to check against and for each DNS host, add a loopback address in the hosts file.
Either way this hack works because now we get to the target host for any revocation check (now ourselves), so the request to check revocations is replied with a concrete “I don’t know what you’re talking about” network response as we’re now hitting something rather than just a router. This means of the start-up revocation check is quickly failed instead of hanging waiting for a reply from no-on, and finally SPAdmin can complete it’s start-up in a timely fashion.
Another Solution (for Upgrades Only)
If this is failing only to complete an upgrade; another option is to run psconfig with the –force parameter. This won’t obviously help when adding a new server but it will do the job of completing the upgrade at least.