SharePoint Search *Quirks: Adding Content Sources

I recently encountered a scenario in which adding SharePoint Search content sources of various types demonstrated different behaviors depending on the particular order and the particular type of content source being added.

Initially, the behaviors looked unintentional, but after further research and review of the underlying source (which obviously, I can't disclose, but I can generalize the logic here), the behaviors were very expected and appear to be by design (A small disclaimer here: I don't work within the SharePoint Product Group, and I don't speak for the PG, so this should only be considered my interpretation and not an official stance by the Product Group).

The Scenario:
Assume you have a single Search content source named "Web - for httpFooAsdf" that was created as type "Web Sites" and only contains the lone start address https://foo/asdf . It's worth noting here that you can easily reproduce the following behaviors using non-existing URLs such as https://foo and do not even have to invoke any crawls to trigger this error.

Next, create a second content source named "Web - for httpFooToo" that was also created as type "Web Sites" and only contains the lone start address https://foo/too. We expect this to succeed and should be able to create [n] more "Web" content sources with start addresses also relative to https://foo (assuming that none were actually duplicate)... so far, so good.

Now, let's change it up. What if https://foo is actually a SharePoint Web Application with portions ("asdf' and "too") being crawled as "Web" (e.g. a Public facing site), but you now also want to crawl some of its Site Collections as SharePoint content. For this case, you then add two more content sources "SP Sites - for httpFooSitesABC" and "SP Sites - for httpFooSitesXYZ" with each containing https://foo/sites/abc and https://foo/sites/xyz respectively. Again, we would expect these content sources to be successfully created.

Finally, we now need to create another new "Web" content source named "Web - for httpFooTastic" with https://foo/tastic. Based on what we've done so far, this should work, right? Actually though, it fails with: 

x The start address "https://foo/" already exists in this or another content source

The Explanation:
This may seem like SharePoint content sources have more internal "checks" or functionality that allow you to add more "SharePoint" content sources with start addresses relative to https://foo (as seen with https://foo/sites/xyz), yet prevents new "Web" content sources with virtually the same configuration from being created (as seen with https://foo/tastic). However, the fallacy here is that there is only one kind of "SharePoint" content source, when effectively there are actually two:

  • SharePoint content source for Site Collections
  • SharePoint content sources for Web Applications

A quick proof for this can be demonstrated by - instead of "Web - for httpFooTastic" - attempting to create another content source of type "SharePoint" using the start address "https://foo" with the crawl behavior set as "Crawl everything under the hostname for each start address". In this case, the same error " The start address "https://foo/" already exists in this or another content source " will once again occur. In other words, the behavior that leads to this error is simply not tied to "Web" content sources.

The internal logic around this error makes understanding the behavior more apparent, which we can simulate with the following PowerShell (Note: When adding the new content source, the content source object is created first, and then the start addresses get added to this new content source in a second action. The error message occurs during the second step when adding the start addresses. Being the case, this example starts with an existing content source object and intends to simulate the logic occurring when adding the start address to the applicable content source).

For example, assume you wanted to add the start address "https://foo/testing" to a content source named "for httpFooTesting" (Note: I'm being intentionally vague as to the type here because the Web and SharePoint content sources go through this same logic when adding a start address). The simplified steps resemble the following:

$SSA = Get-SPEnterpriseSearchServiceApplication "-name-of-your-SSA-"

$CS = $SSA | Get-SPEnterpriseSearchCrawlContentSource "httpFooTesting"

function testNewStartAddress { param ([Uri] $newUrl)

   $hostUrl=$newUrl.Scheme + [Uri]::SchemeDelimiter + $newUrl.Authority

   if ($cs.SharePointCrawlBehavior -eq "CrawlSites") {

       #Pseudocode :

       Write-Host ("Check if...")

       Write-Host (" 1) "+$hostUrl+" is in another SP Web App content source")

       Write-Host (" 2) "+$newUrl+" exists in any content source")

       Write-Host ("If both `$false, then "+$newURL+" is valid...")

   } else {

       #Pseudocode :

       Write-Host ("Check if...")

       Write-Host (" 1) "+$hostUrl+" overlaps any SP Sites content source")

       Write-Host (" 2) "+$newUrl+" exists in any content source")

       Write-Host ("If both `$false, then "+$newURL+" is valid...")

   }

}

testNewStartAddress "https://foo/testing"

From this, we can see that the specialized case revolves around SharePoint content sources for Site Collections (e.g. testing if($cs.SharePointCrawlBehavior -eq "CrawlSites") ). For content sources of type "SharePoint" that have a crawl behavior of "CrawlVirtualServers" and for content sources of type "Web" (which do not have the property "SharePointCrawlBehavior"), this will evaluate as false and thus be processed by the second block.

Simplistically speaking, the additional logic is intended to ensure that a Site Collection in one content source is not being crawled again by a content source that specifies its parent Web Application (such as Site Collection https://foo/sites/xyz and Web Application https://foo). Further, by limiting the checks of $hostUrl to just SharePoint content sources, flexibility could then be maintained for "Web" content sources.

For example, if you configure a "Web" content source with "https://foo", you are still able to add "https://foo/anything/else" in another "Web" content source (Otherwise, if $hostUrl were checked in across all content sources, then adding "https://foo/anything/else" would then result with the "already exists in this or another content source" error).

Being said, there does seem to be an assumption that a host URL (e.g. https://foo in the examples here) would not be crawled as both a SharePoint content source as well as a Web content source. Although there are always cases that I haven't considered, I would suspect that this scenario could be handled by just crawling the entire Web Application and allowing the security trimming to remove any items that were not crawled as "anonymous".

I hope this explains the *quirk...