SharePoint Search Service - Failover & Outage Resiliency

I thought I’d share some tests I’ve done on how much more resilient the new search engine is to server outages now in 2013 just because I’ve done some research on it just recently. It’s especially nice for consuming farms because there’s a nice abstraction from any server apocalypse going on in the service publishing farm; the consuming farm just carries on anyway and everything keeps working nicely.

Back in the days of 2010, if your single search administration server went down then you could kiss goodbye to the search service application it was administering and also any relying apps/services/web-parts/pages that needed it. No more in 2013; there no longer need be one single point of failure for your search topology, if you have the hardware set up a decent topology that is.

Anyway, here you see a connection to a published search service from a consuming farm:

clip_image002

The publishing string for the published search-app is:

urn:schemas-microsoft-com:sharepoint:service:d184aa7911cb41269598b3780592ff52#authority=urn:uuid:ef685d49d59c4782b594f23f163d11eb&authority=https://sp15-search-crl: 32844/Topology/topology.svc

Notice the server being mentioned there.

High Availability SharePoint Search

If we look at that topology on the publishing farm I’ve basically triplicated all the services. Admittedly this isn’t so normal for performance reasons but it’s setup that way just to demo the point.

clip_image004

Now let’s kill the server in the publishing string.

clip_image006

We can see the effect fairly immediately in the search management page:

clip_image008

Clearly there’s a problematic server there; one that the web-front-end in the other farm was going to use.

If we look at the event-logs on said WFE we can see a health warning thrown up by a timer-job – it’s nice to know things might be a bit stormy; the consuming farm isn’t aware what the impact will be of course so flags it just in case.

clip_image010

Never mind though because crucially, as there was no single point of failure, searched still work no problem for our consuming farm/web-application:

clip_image012

Here we see the web-front-end (WFE), it has adapted just fine to the outage and search results are coming in anyway.

Looking at the WFEs logs you’d be forgiven for not realising there was even a problem.

clip_image014

Notice the new server-name in there (there is one; the screenshot isn’t particularly clear). No errors or warnings; as far as that concrete query operation is concerned there is no problem.

And that’s it!

Obviously it’s not magic; if there’s not enough redundancy built into your topology then it’ll all come crashing down but I could turn off any one of the search servers and nothing really would happen. It’s a highly-available search solution, finally!

Also this isn’t specific to published apps either, just the messages are nicer. Anyway, I hope someone found it useful!

 

Cheers,

// Sam Betts