Understanding SharePoint - Part 1 - Understanding the SharePoint Portal Server Indexer

Article
07/20/2006

Understanding SharePoint - Part 1 - Understanding the SharePoint Portal Server Indexer

This is the first in a series of posts in regards to "Not being mislead by what your seeing :)"

In the next post in this series, I'll talk about The Infamous Query Plan Bug, and the origins of SPSiteManager

Be careful not to misinterpret what your seeing in regards to the SharePoint Portal Server Indexer.

I recently worked with a customer who was experiencing "well appeared to be experiencing" poor crawl performance from SharePoint Portal Server 2003. I'll be following up to this posting with results from the MOSS 2007 implementation of Search, but I'm sure they results will be pretty much the same.

The key here is to ensure that you are not misinterpreting the data you are being presented with. Keep the following two items in mind when examining your own portal

Item 1: Certain IFilters can lead you to think that your crawler is not working effectively, yet some can actually have a dramatic impact on performance

Crawl rates can easily be misinterpreted when you have 3rd party iFilters installed. This section will show you some guidance we give in relation to the impact that certain iFilters can have on your environment. I am "NOT" saying that these iFilters are "BAD", and you should not misinterpret me saying that :) I'm just noting the impact that they can have on your environment that you need to be aware of :).

Tests were run using a standalone Dell Precision 470 Intel XEON Server with CPU running at 3.0 GHZ. Total physical memory on the Server was 3.0 GB. SharePoint Portal Server and Windows SharePoint Services were both installed at Service Pack 2. This is a single Server deployment of SharePoint with SQL Server 2000 at SP4 loaded on one Server.

Performance Counters Used

SearchGathererProjects\Processed Documents Rate (Hence known as PDR in this document)

This counter identifies “The number of documents processed per second”

A standard set of test files was uploaded to both the default document library at the root Portal Area level and a test Windows SharePoint Services site document library at such as https://server/sites/testsite
The set of seed files used for testing consisted of 18 Zip files (which resulted in about 800 individual files unzipped, and a variation of .txt, .doc, .ppt, .xls, etc), 16 PDF files, and 8 standalone Word documents.
Initial testing with no 3rd party iFilters installed resulted in a PDR of around 22 documents per second processed with a maximum of around 116.
The addition of the Adobe iFilter at version 6.0 resulted in an approximate 40 to 50 percent performance reduction. The average PDR dropped to around 9 with a maximum of 72.
A Zip iFilter at version 1.2 from the I-Filter Shop was loaded and the Adobe iFilter was removed. This resulted in about the same performance hit as the Adobe iFilter caused.
The Zip iFilter only added a total of 18 items to the index, so the performance hit is misleading. In reality it added 800 unzipped items to the index.
With both the Zip iFilter and Adobe PDF iFilter installed and added to the list of file types to be crawled performance dropped again by another 10 to 15 percent resulting in a total performance hit of around 65 percent. Again, this is a bit misleading as the Zip iFilter actually added an additional 800 items to the index that only appear as a single item in the total for the Search catalog.

Performance Conclusion

Installation of the PDF iFilter will have the most adverse impact on indexing performance due to the fact that it is a single-threaded iFilter.
Although we appear to take a large performance hit with the Zip iFilter, performance may not be hampered due to the fact that we are actually adding numerous items to the index instead of only the one that appears in the index item count.
The Zip iFilter provided by the I-Filter Shop does indeed run as a multi-threaded filter.

Item 2: Don't misinterpret your gatherer logs.

You may find many "Delete" entries in your gatherer log for URLs that no longer exist, but the delete entries happen all the time, and it appears to never remove the entry from the index.

By default, SharePoint will not automatically remove an index entry until 3 consecutive crawls have occurred (whether it be full, or incremental). The reason for this, is we don't want to just remove the entry, as there may have been a network issue temporarily, and if we did this for every crawl, it could mislead users. For instance, if you had an alert on a document library, the user would constantly see "New content found" alerts for the same document between crawls because of a flaky network connection. Thus, if we can't connect to the target source after 3 attempts, then we consider it a dead link.

What you may actually be seeing is hits to sites from the portals "Sites" content source. Regardless if there is an entry in the index or not, if you have a site listed in the site directory, it will "ALWAYS" be checked from the crawler. In this case, the site is gone, so every attempt is going to fail. Thus, after 3 times, it sends a Delete transaction to the index to remove it from the index. The same would be true to a content source that has a target url to a site that no longer exists. Until you get rid of that site reference, or the content source, you will continue to see those entries in the gatherer log.

I've added some new features into SPSiteManager just for this reason, which allow you to clean up dead entries from your site directory and list of sites to be crawled.

Hope this helps!

- Keith

Understanding SharePoint - Part 1 - Understanding the SharePoint Portal Server Indexer

Additional resources