Creating crawl schedules and starvation - How to detect it and minimize it.

Article
05/09/2008

Hello again, Dan Blood here. For this post I want to discuss starvation as it relates to the Search crawler. One of the more difficult tasks that the Search admin is going to face is figuring out how to build out the myriad of crawl schedules needed to keep your content freshly indexed. When you are building out these schedules you will want to keep a close eye on the system using the monitoring information below and slowly add new schedules to minimize starving the crawl of resources while maxing out the utilization of the crawler. Starvation for Enterprise Search is defined as the crawlers inability to allocate another thread to retrieve the next document in the queue of work. Taken broadly this can be caused by resource (I/O) contention on the SQL machine, too many hosts concurrently participating in the crawl, "hungry" hosts that do not quickly relinquish a thread and finally back-ups (since crawls are paused during this time).

To make this conversation a little more tractable I need to define what I mean by a "hungry" host. Hungry hosts are defined as hosts that lock up resources on the crawling side in one or more of the following circumstances:

Slow hosts: This is the obvious case where a host being crawled does not have the capacity to service all of the requests that the crawler is sending to it. Sending more concurrent requests to this server can cause it to slow down even further.
Hosts requiring extra work for incremental crawls: The primary example of this is SharePoint 2003. This store tends to have a high rate of security changes and the crawler processes the entire document when a security change is detected. Basic HTTP crawls are partially in this bucket since each document requires a round trip to the server, but the modified date is checked prior to downloading the entire document.
Hosts and content that is rich in properties: You will see this more commonly with the following content store types: BDC, People Import and People crawls, but any store can exhibit this trait. BDC, People Import and People crawls by default have an large number of properties per document which causes the SQL machine to do more work than average.

The most efficient type of crawls are:

SharePoint 2007: These content stores store a log of changes that have been made allowing the crawler to be very selective what content to download for incremental crawls.
File Shares: Detecting if a document has been changed still requires a round-trip but the check can be done at folder levels allowing an entire folder hierarchy to be skipped if nothing lower in the hierarchy has changed.
Exchange Public Folders: These crawls behave just like File Share crawls.

Using the above as a guideline when you start building out your content sources and crawl schedules you should use the following guidelines:

1. Minimize the number of content sources that you have. Grouping hosts of the same repository type and similar size into individual content sources. The intent here is to reduce the overall count of crawls that your system will have.

2. Crawl your large SharePoint 2007 data stores first and do so until you reach steady state.

For SearchBeta this typically means crawling the large repository for approximately 7 days. Then 3 to 4 incremental crawls are required after this to clear out any timeout errors seen in the initial full crawl. Keep an eye on your error count per crawl, when this number is low relative to the amount of content in the crawl and does not change from one crawl to the next you have reached steady state Once this state is reached incremental crawls of very large repositories can take only a couple of hours if the change rate in the store is relatively low.

One trick that I commonly use for the initial crawl of these sites is to start with a schedule that starts an incremental every 30 minutes. This allows the successive incremental crawls to start in the middle of the night when you are not around to see the crawl complete.

3. Do not schedule more than one (1) "hungry" content source at a time.

4. Start with a minimum of 4 concurrent crawls. This is your starting point, use the data below to determine if your system has the head-room to add additional concurrent crawls.

5. If you reach a starved state it is best to pause your "hungry" crawls to let the remaining crawls complete.

Determining if the crawler is in a starved state

The following data should be periodically analyzed during different periods of building and maintaining your crawl schedule(s). You will want to look through the data below several times during this process. Initially you will use this information to create your content sources and crawl schedules. Verifying that you are not starved before adding the next crawl schedule. Then you will want to look at this data during different times for the crawl, paying specific attention to the beginning and end of crawls containing a large amount of data. Finally you will want to look at this data on a periodic basis. The content you are crawling will change and grow. This growth may be enough to drive you into a starved state and thus miss your freshness goals.

First you need to understand how many Crawl threads are used for your hardware and the maximum number of threads that can be used per host. This number of is based on of the number of processors that the indexer has and the Indexer Performance setting in the Configure Office SharePoint Server Search Service Settings UI. You can also modify the number of Crawl threads per host via Crawler Impact Rules.

These threads are the critical resource that will get starved. The goal of minimizing starvation is to make sure you are not constrained on these resources, while maximizing their usage. As such you want to avoid having more hosts in the crawl than you have threads to support and you want the majority of these threads to spend a small amount of time accessing a single document in the crawl

The number of threads the system will use is based on the settings you make to the Indexer Performance setting and will be as follows. In large scale environments it is recommended that you set this to Maximum, keeping in mind that you can use Crawler Impact Rules toreduce/increase the number of threads per host to reduce the load you are placing on each repository :

Indexer Performance - Reduced
- Total number of threads: number of processors
- Max Threads/host: number of processors
Indexer Performance - Partially reduced
- Total number of threads: 4 times the number of processors
- Max Threads/host: number of processors plus 4
Indexer Performance - Maximum
- Total number of threads: 16 times the number of processors
- Max Threads/host: number of processors plus 4

There is a hard-coded max on the number of crawl threads of 64.

Monitoring

1. The first thing to look at and the most common bottle-neck are the two performance counters below for the Archival Plugin. If they are both consistently at 500 for the Portal_Content instance or 50 for the ProfileImport instance, then you are in a starved state and you are likely bottle-necked in SQL for I/O on the Crawl DB drive. Look into tuning SQL for better I/O. (an upcoming post will cover diagnosing SQL I/O bottle-necks and recommended practices for configuring SQL)

The counters are in the object Office Server Search Archival Plugin
- Total Docs in first queue
- Total docs in second queue

2. Assuming you are not bottle necked in the Archival Plugin The following data is used to determine if you are in a starved state. Crawl threads can be in one of 4 stages: non-existent, idle, on the network, or in a plug-in. You can see what state they are in via Performance Monitor. Note that these counters change rapidly so it is advisable to look at them over time in a chart to see trending and averages. Also a thread will not stay in the idle state for an extended period, if there is consistently no work for a thread to do it will be terminated.

The counters are in the object Office Server Search Gatherer
- Idle Threads – These threads are not currently doing any work and will eventually be terminated. If you consistently have a more than Max Threads/Hosts idle threads you can schedule an additional crawl. If this number is 0 then you are starved. Do not schedule another crawl in this time period and analyze the durations of your crawls during this time to see if they are meeting your freshness goals. If your goals are not being met you should either reduce the number of crawls.
- Threads Accessing Network – These threads have sent or are sending their request off to the remote data store and are either waiting for a response or consuming the response and filtering it. You can distinguish the difference between actually waiting on the network versus filtering the document by looking at a combination of CPU usage and Network usage counters. If this number is consistently high then you are either network bound or you are bound by a "hungry" host. If you are not meeting your crawl freshness goals. You can either change your crawl schedules to minimize overlapping crawls or look the remote repositories you are crawling to optimize them for more throughput.
- Threads In Plug-ins – These threads have the filtered documents and are processing it in one of several plug-ins. This is when the index and property store are created. If you have a consistently high number for this counter check the Archival plugin counters mentioned above.

3. Given the above information you know how many threads can be active at a given time and the maximum number of concurrent hosts that can be crawled at one time. With this information and the performance counters above you will see starvation occur in four different ways:

a. Starved by time spent in the Archival plug-in. The only way to fix this is to improve I/O latency on your SQL machine. Notably with the spindles hosting the Query portion of the SharedServices_Search_DB database. Stay tuned for a white paper discussing how you can separate the Query data away from the Crawl data into separate file-groups within SQL. Thus allowing you to individually tune the disks behind these two key pieces of data.

b. Starved by a "hungry" data store(s) . The crawler has a limited set of threads that it can allocate to perform a crawl, having a single "hungry" host that is being crawled does starve the gatherer slightly as threads in use for this host are not quickly made available for the next item in the queue. However, the problem is dramatically worse with multiple "hungry" hosts. It is recommended that you identify your “hungry” hosts (see discussion above for key type of "hungry" stores) and build out your crawl schedules such that you never have more than a single big "hungry" host being simultaneously crawled.

c. Starved by a large number of hosts. Again there are limited number of Crawl threads, this coupled with the number of threads per host sets a very hard limit on the number of hosts that can be concurrently crawled. If the crawler is maxed out on the number of hosts; adding another host to crawl will not only starve this host in the crawl but it will also starve all other hosts in the crawl. Thus making the overall duration of all of the concurrent crawls be increased and reduce the likelihood that they system will be able to maintain a steady state. Recommended solution for this is to reduce the number of concurrent crawls.

d. Starved by a crawl queue predominantly filled with items from a single host. This is a state caused by a host that contains a lot of content which is laid out in manner that is very wide and not very deep. All types of data stores can exhibit this behavior, but it is easiest to describe with the file system. If you have a directory system that has a single folder with a hundred thousand documents within it you will see this type of starvation. Effectively the crawl queue is filled with these 100k items, the first 8 threads (number is hardware dependant) are able to do work on these items, but due to the threads/host limit and the availability of another host the remaining threads will not do any work. Three types of stores always exhibit this: BDC, People Imports , and People Crawls as they are all flat containers. The recommended solution here is to consider these type of stores as “hungry” stores and follow the recommendation of limiting the number of concurrently crawled “hungry” stores to one.

This post took a lot of effort from other members of the team. I would like to explicitly thank: Sid, Mircea and Joe for their help in putting this together.

Thanks and I look forward to speaking with you all in a few weeks. The next post should cover SQL monitoring

Dan Blood
Senior Tester
Microsoft Corp

Creating crawl schedules and starvation - How to detect it and minimize it.

Additional resources