Beware crawling the non-Default zone for a SharePoint 2013 Web Application

Update: I've now published another post "Problems Crawling the non-Default zone *Explained" that goes on to explain the underlying behaviors that I warned about and described in this post...

---------------------------------------

After playing for a while with SharePoint 2013 Search, I thought we were out of the woods regarding crawls of the non-Default Alternate Access Mapping (AAM) zone for a SharePoint Web Application. This caused all sorts of problems in earlier versions of SharePoint (primarily busted contextual scopes, broken social tagging, and workflow emails linking to the incorrect zone) because there is a built in assumption by other components throughout SharePoint that the Default zone is being crawled.

I'm still working to fully nail down the impacts for SP2013, but, from my initial testing [in SP2013], when crawling a non-Default URL, all search results will be relative to the URL crawled rather than the URL from which you query (and suspect it’s going to break scoping rules for queries as well), meaning you will get unexpected URLs when you query.

Update: I want to seriously caution against using Server Name Mappings, particularly in SharePoint 2013. Admittedly, with SharePoint 2010, Server Name Mappings did appear to provide a workaround. However, although they appear to work, Server Name Mappings were definitely not designed for this particular scenario.

Second, In SharePoint 2013, I know for certain that some managed properties (e.g. SPSiteUrl and ParentUrl to name two) in the Index absolutely do not get *updated by Server Name Mappings, so adding them will only make the problem worse!!! In other words, you'll have some URL-based properties that are relative to one URL and other MPs relative to the mapped URL...

But because Server Name Mappings were not intended for this scenario, I would not have expectation that this should work in all cases.

For example, if I issued a query from some site in the Web Application https://initech, then I should expect all results from this Web Application to be returned relative to https://initech (as in https://initech/result1.aspx and https://initech/result2.aspx). However, if I were crawling the URL of a non-Default zone, then my results will all be returned relative to this non-Default URL (such as: https://bargainclownmart:88/sites/myTeam/result1.aspx and https://bargainclownmart:88/sites/myTeam/result2.aspx ).

Update: I recently published "Alternate Access Mappings (AAMs) *Explained" to provide more insights on AAMs and to better illustrate its often misunderstood concepts.

In this scenario below, I have two Web Applications with the following Alternate Access Mappings (as a side note, I believe Host Named site collections are now the preferred method over AAMs, but I wanted to demonstrate this as an example):

Internal URL Zone Public URL for Zone
https://sp-foo:88 Default https://sp-foo:88
https://testingfoo:88   Intranet https://testingfoo:88
https://bargainclownmart:88 Internet https://bargainclownmart:88
https://bargainclownmart.officespace.lab:88    Extranet      https://bargainclownmart.officespace.lab:88   
 https://faceman  Default  https://faceman 
 https://initech  Intranet  https://initech  
 https://initech.officespace.lab Internet  https://initech.officespace.lab

 

Observed behaviors when crawling the Default URLs...

In my content source, I specify https://faceman and https://sp-foo:88 as the start addresses and then perform a full crawl.

As expected, the URL for results is relative to the URL from which the query is performed. For example, notice the URL in the browser's address navigation bar shows https://sp-foo:88 and the results for this Web Application are also displayed relative to this same https://sp-foo:88 URL:

Results related to another Web App would also be relative to this zone (which to knowledge is new to SP2013). For example, if I query from the https://initech URL (in other words, from the Intranet zone), then all results related to this Web App would be relative to the https://initech URL (such as https://initech/result1.aspx, https://initech/result2.aspx, etc...) as seen in the last two results in the screen shot below...

 

For comparison, observed behaviors when crawling the non-Default URLs...

In my content source, I then specify https://faceman and the Internet zone https://bargainclownmart:88 as the start addresses and then perform a full crawl.

For my queries from any zone for any Web App, the search results related to the https://sp-foo:88 Web App will always return relative to the URL that was crawled... in this case https://bargainclownmart:88. In other words...

 

The moral to this story...

Always crawl the default URL (*the URL being crawled must be a Windows Authenticated zone) unless there is a REALLY good reason otherwise.