POC - Part 3a: PowerShell: Crawling an External Web Site with the SharePoint 2010 Web Crawler

And now is time to speak of many things...but we won't. Instead we will create, configure and run the SharePoint 2010 default web crawler from PowerShell.

Remember to open up a PowerShell window from the Start menu:

Start --> Fast Search Server 2010 for SharePoint (right click --> Run as Administrator)

The Short Version

  1. Add the SharePoint PowerShell cmdlets

    Add-PSSnapin Microsoft.SharePoint.PowerShell -ErrorAction SilentlyContinue

  2. Create and configure the Content Source (enter a URL that doesn't mind you crawling it. Perhaps your blog page?)

    $contentSSA = "FASTContent"

    $startaddress = [enter a URL here]

    $contentsourcename = "Web site crawl"

    $contentsource = New-SPEnterpriseSearchCrawlContentSource -SearchApplication $contentSSA -Type Web -name $contentsourcename -StartAddresses $startaddress -MaxSiteEnumerationDepth 0

  3. Start the crawl

    $contentsource.StartFullCrawl()

    $contentsource.CrawlStatus

    1. Execute $contentsource.CrawlStatus again.
    2. Wait. Wait. Wait. Wait.
    3. Execute $contentsource.CrawlStatus again.
    4. Wait. Wait. Wait. Wait.
    5. Keep executing $contentsource.CrawlStatus until the status changes to CrawlCompleting and then Idle
  4. Execute a search

The Long Version

Again, there really isn't any reason to go over all the steps as they don't really change from step to step. I do want to clarify a few things.

  • As we installed the advanced filter pack in a previous post we don't need to do that again.

  • In the line of PS that creates the content source

    $contentsource = New-SPEnterpriseSearchCrawlContentSource -SearchApplication $contentSSA -Type Web -name $contentsourcename -StartAddresses $startaddress -MaxSiteEnumerationDepth 0

    It is interesting to note that the New-SPEnterpriseSearchCrawlContentSource cmdlet defaults to the Custom crawl rule which will read all pages and all links found at the starting URL. In order to mimic the behavior of the previous blog post (and avoiding crawling the world) we set MaxSiteEnumerationDepth to zero which causes the crawler to read the content at the site we started at rather than allowing the crawler to go into ADD mode becoming easily distracted and chasing down every car that goes by.

Thanks

Much thanks to Runar Olsen and Ben Berenstein for letting me bounce ideas off of them (you can both take your helmets off now).

References

SharePoint Server 2010 Search: Windows PowerShell Cmdlets