In broad terms, SharePoint Search is comprised of three main functional process components:
- Crawling (Gathering): Collecting content to be processed
- Indexing: Organizing the processed content into a structured/searchable index
- Query Processing: Retrieving a relevant result set relative to a given user query
In this post, we’ll create a reference baseline by defining key concepts and terminology. In a following post, we’ll take a step-by-step look at the Crawling process, which is leveraged in both SharePoint Search as well as FAST Search for SharePoint (FS4SP). Finally, we’ll then dig further into the Indexing process for SharePoint Search by reviewing protocol handlers, iFilters, and Search Plug-ins used in building the search index.
- I originally planned to write about Query Processing as well, but my friend/colleague Russ Maxwell recently wrote an informative post about Query here. He also wrote a good two part post on Search 2010 Architecture found here and here.
- I also wanted to thank my friend/colleague Anthony Casillas for his technical review and our work together digging deeper into SharePoint Search
Avoiding the double-speak
Unfortunately, like much of SharePoint, the nomenclature for Search may vary depending on your background. For example, in SharePoint 2007 (primarily MOSS, but also WSS), the server with the Indexer Role performed the crawling processes, so you may see other references using ‘Indexer’ and ‘Crawler’ interchangeably.
The term ‘Indexer’ is also muddied because the SP2007 Indexer server holds the master copy of the full-text index, whereas SP2010, the master index resides as partitions across the query components. With this, SP2010 decouples the Crawl Component from the master copy of the index, which can make ‘Indexer’ a misleading term to describe a Crawler.
Another point of confusion comes from the SP2007 option to define Crawling Servers for the Office SharePoint Search Service (e.g. the Web Front End And Crawling section). This option specifies the web front ends to-be-crawled, but these WFEs perform no direct role in the crawling process – just indirectly as web server(s) that fulfill HTTP GET requests from the Crawler …err… the Indexer in SP2007-speak.
Concepts, Terminology, and Working Definitions
For this series (as well as all future posts unless otherwise noted), I’ll only refer to Indexing when discussing the specific process of building the index. For the Crawling process, I’ll emphasize the gathering aspects or interchangeably use Gathering. Otherwise, any reference to a Crawl (e.g. a full or incremental crawl) generally refers to the overall crawling process, which implies the full pipeline with both crawling and indexing.
To further consistency, consider the following my Rosetta stone of sorts (and I’ll try to note other points of cross-over along the way). Also, I’ve included links where possible (many of which are to Office 12 documentation, but are relevant to SharePoint 2010 nonetheless).
Update (11/2/2012) I’ve noticed that a lot of the content linked below has bee “permanently removed from the website”. I’m working internally to find if these are/were replaced and will try to get these updated as soon as possible.
- Search Administration
- In a nutshell: Typically, users will interface with this component via Central Admin (Central Admin -> Manage Service Applications -> [click on the SSA] ), but entails both an Search Admin Web Service and an SSA Admin DB
- Crawl Components
- TechNet: “In Microsoft SharePoint Server 2010 Search, crawl components process crawls of content sources, propagate the resulting index files to query components, and add information about the location and crawl schedule of content sources to their associated crawl databases. Crawl components are associated with a single Search Service Application.” http://technet.microsoft.com/en-us/library/ee805950
- Additional Notes: There is a many-to-one relationship between Crawl Components and a Crawl DB (e.g. each Crawl Component associates to one and only one Crawl DB, but each Crawl DB can manage multiple Crawl Components
- Query Components
- TechNet: “In Microsoft SharePoint Server 2010 Search, query components return search results to the query originator. Each query component is part of an index partition, which is associated with a specific property database that contains metadata associated with a specific set of crawled content.” http://technet.microsoft.com/en-us/library/ee805953
- Search Service App (SSA) Admin Database (the “Search Admin DB”)
- In a nutshell: The SSA Admin DB helps manage high level aspects of the SSA such as the search topology, crawl state/history, and host distribution & refactoring. This also stores the security descriptors (ACLs) used to security trim query results
- TechNet: “The Administration database hosts the Search service application configuration and access control list (ACL), and best bets for the crawl component. This database is accessed for every user and administrative action.” http://technet.microsoft.com/en-us/library/cc678868
- Useful Tables for Troubleshooting:
- Crawl Database (the “Crawl Store”)
- In a nutshell: The Crawl DB helps manage aspects of Crawls including scheduling, content sources, and related crawl components. It also provides a crawl queue, tracks the status of crawled URLs, and stores links/text from anchor tags discovered during crawls
- TechNet: “In Microsoft SharePoint Server 2010 Search, crawl databases contain data related to the location of content sources, crawl schedules, and other information specific to crawl operations for a specific Search Service Application… Crawl databases are associated with crawl components, and can be dedicated to specific hosts by creating host distribution rules.” http://technet.microsoft.com/en-us/library/ee805952
- Stored Procs:
- Property Database (the “Property Store”)
- TechNet: “In Microsoft SharePoint Server 2010 Search, property databases contain metadata associated with crawled content… Property databases are associated with index partitions, and return any metadata associated with content in query results.” http://technet.microsoft.com/en-us/library/ee805954
- MSSDocProps …contains pairings of DocIDs and managed property IDs along with the associated values
- MSSDocSdids …contains both the Sdid (“the unique identifier of the search security descriptor of the item”) and the DuplicateHash (“used for duplicate result removal”)
- Additional Notes: The Property Database is not used with FAST Search
- Search Admin Web Service (SearchAdmin.svc in IIS)
- Search Query and Site Settings (SQSS) Web Service (SearchService.svc in IIS)
- In a nutshell: The SQSS is called by the WFE to handle queries. It also serves as a load balancer to query components
- TechNet: “The Search Query and Site Settings service is an Internet Information Services (IIS) service. By default, this service runs on each server that includes a search query component. The service manages the query processing tasks, which include sending queries to one or more of the appropriate query components and building the results set. At least one instance of the service must be running to serve queries.” http://technet.microsoft.com/en-us/library/ff468691
- SharePoint Server Search service (MSSearch.exe)
- MSDN: “Component of the Search service that manages the content crawling process and has rules that determine what content is crawled.” http://msdn.microsoft.com/en-us/library/dd588216(v=office.11).aspx
- MSDN Magazine: “The Gatherer Pipeline… is responsible for crawling and indexing content from various repositories, such as SharePoint sites, HTTP sites, file shares, Lotus Notes, Exchange Server and so on. This component lives inside MSSearch.exe” http://msdn.microsoft.com/en-us/magazine/ff796226.aspx
- Additional Notes: Can be stopped/started from the services.msc plugin or by using: net [stop|start] osearch14
- SharePoint Search Filter Daemon (MSSDmn.exe)
- MSDN Magazine: “When a request is issued to crawl a repository, the gatherer process [MSSearch.exe] invokes a filter daemon, MssDmn.exe, to load the required protocol handlers and filters necessary to connect, fetch and parse the content” http://msdn.microsoft.com/en-us/magazine/ff796226.aspx
- MSDN: “Component that handles requests from the Gatherer. Uses protocol handlers to access content sources, and IFilters to filter files. Provides Gatherer with a stream of data containing filtered chunks and properties.” http://msdn.microsoft.com/en-us/library/dd588216(v=office.11).aspx
- Additional Notes: The mssdmn.exe is designed to enable extensibility by providing isolation from the Search Service. In other words, if a mssdmn.exe process crashes because of a flaky iFilter or excessive memory usage (again, typically related to iFilters), the mssdmn.exe may be cleanly terminated without impacting the Search Service itself (in which case, a new mssdmn.exe process will be spawned)
- Search Service Application: “The Windows user account that is used for the SharePoint Server Search service, the Search Admin Web Service application pool, and Search Query and Site Settings Web Service application pool… You can use the same account for the Search service account, Search Admin Web Service, and Search Query and Site Settings Web Service.” http://technet.microsoft.com/en-us/library/gg502597
- Default Content Access: “The identity that is used by the Search service application to access content when crawling… For the default content access account, we recommend that you use a separate account to provide security isolation.” http://technet.microsoft.com/en-us/library/gg502597
- SharePoint Server 2010 Search: Windows PowerShell cmdlets (en-US) http://social.technet.microsoft.com/wiki/contents/articles/204.sharepoint-server-2010-search-windows-powershell-cmdlets-en-us.aspx
This post intended to describe each of the puzzle pieces involved with SharePoint Search. In coming posts, I plan to further show how these pieces fit together – first by explaining the crawling process and then by looking into the Indexing process for SharePoint Search.
- Concepts and Terminology
- The Crawling Process and Flow ( here )
- SharePoint Search Indexing /* coming soon */
Update (10-29-2012): This post was written prior to the release of SP2013. In hindsight, I should have held off on starting this in order to include SP2013, but as is, the content is most applicable to SP2010.
- Working with Microsoft FAST Search Server 2010 for SharePoint http://www.amazon.com/Working-Microsoft-Search-Server-SharePoint/dp/0735662223
- SharePoint Brew
- Search 2010 Architecture and Scale – Part 1 Crawl http://blogs.msdn.com/b/russmax/archive/2010/04/16/search-2010-architecture-and-scale-part-1-crawl.aspx
- Search 2010 Architecture and Scale – Part 2 Query http://blogs.msdn.com/b/russmax/archive/2010/04/23/search-2010-architecture-and-scale-part-2-query.aspx
- Guide to walking a SharePoint 2010 Search Query behind the scenes http://blogs.msdn.com/b/russmax/archive/2012/02/15/guide-to-walking-a-sharepoint-2010-search-query-behind-the-scenes.aspx
- Learning roadmap for Search in SharePoint 2010 (including FAST
Search for SharePoint) – Part 1: Search 101 and Architecture http://searchunleashed.wordpress.com/2011/05/10/learning-roadmap-for-search-in-sharepoint-2010-including-fast-search-for-sharepoint-part-1-search-101-and-architecture/
- Trim SharePoint Search Results for Better Security http://msdn.microsoft.com/en-us/magazine/ff796226.aspx