What may happen when I crawl MILLIONS of files in MOSS/MSS? Part I

1. On MOSS box, CPU usage seems very high for several hours. Target system may also suffer from low performance.

This will happen in several situation, especially after you changed crawl impact rules. By default, MOSS/MSS will request for 8 files at one time for a single server, you can change it to 64 at the most. But remember, although sometimes this can help with crawl speed, it will hurt performance of both MOSS and target systems. So, if there's no special needs, do not set this value to too high. For low performance servers, you may want to increase the interval between two file requests.

Meanwhile, crawl schedules should be adjusted to prevent target system from being impacted in business hours.

2. Crawl time takes too long. Only ~30,000 files can be crawled per hour.

Check the bottleneck first. You can use some program to monitor the bandwidth, cpu usage, sql box performance... But don't forget to check your NIC. Let's say you have a 100Mbits connection to the intranet. So on average, you can get 8~10Mbytes per second, which means 480~600Mbytes per minutes, 29~36Gbytes per hour. Considering other factors, it is about less than 30Gbytes.

Then take a look at the content you are crawling. If the average size of your files is about 1Mbytes, which is very common if that is a mixed set of PPT/DOC/XLS files, you can of course only crawl about 30,000 files per hour.

So, increase your network bandwidth is a key to crawl speed.

Sometimes, nothing wrong about the MOSS box, nothing wrong about your network bandwidth, it's just because your target system is too slow, for example an old Domino server. In this case please refer to point 1.

To be continued...