SQL Monitoring and I/O

Hello again, the interest in the existing posts is starting to climb. As always feel free to post comments or questions to any of the posts and I will either answer your questions directly or add them as topics for the next set of posts.

This week I wanted to call out some information around monitoring SQL from a Search perspective and provide some guidance around configuration of the SQL machine. There are many great articles out there that already document much of this information. So I am attempting to consolidate this information and provide you with my "reading list." I'll also summarize some of this data in the key areas you should focus on when working with the SQL box that Search is using.

Primary list of documents to read through, ordered in recommended reading order:

1. Planning and Monitoring SQL Server Storage for SharePoint - This is a good document to start with as it discusses many of the topics at a high-level. However, it does not discuss much about the SQL utilization that Search has.

2. SQL Storage Top 10 Best Practices - A great primer for issues to consider when building out hardware for the SQL box in a Search deployment.

3. Optimizing tempdb Performance - This is very applicable to Search and I strongly recommend you follow the guidance of this article. The tempdb is used in every end-user query executed. Plus, the crawler makes reasonable use of the tempdb. Making sure the tempdb is performant will directly impact the throughput and latencies of end-user queries.

a. Working with tempdb in SQL Server 2005 - A more detailed description of how and when the tempdb is used. Useful as supportive documentation as to why you should optimize the tempdb performance.

4. SQL Predeployment I/O Best Practices - This is a great article discussing the steps you should take to validate the I/O system of a SQL box prior to going into production.

5. Troubleshooting Performance Problems in SQL Server 2005 - This is a great paper discussing all of the various bottle-necks that SQL can have.

As hinted at above Search uses SQL in a very I/O intensive fashion.  It is sensitive to I/O latencies on the TempDB and the Query and Crawl file groups.  The basic recommendation from the SQL team is to keep your latency (Avg. Disk sec/Read and Avg. Disk sec/Write) less than 20 ms for OK performance.  For Search I would strongly recommend:

  • 10ms or less for TempDB.   Both Search and content hosting make heavy use of Temp DB, at this scale point it is recommended that you split the content away from the SSP/Search.
  • 10ms or less for  the Query file group
  • 20ms or less for  the Crawl file group
  • 20ms or less for  the and database Log file

As you'll note below SearchBeta is currently unable to do this, we are in the process of obtaining new hardware (more disks) to rectify this.  Knowing I had limited hardware going into this project I allocated more spindles to the TempDB and Query file-group than the other files.  See SearchBeta Hardware definition for more details.

With our hardware we are close to meeting our crawl freshness goals of less than 24 hours for the high-value repositories.  We see 24 hours fresh for the smaller sites and around 50 hours for the bigger ones.  Our query latencies tend to be in the 2 second range for 95% of our measured queries.            

IOPs, Throughput and Latencies:

Drive IOPs Read (max) IOPs Write (max) Ratio Read/ Write Throughput Read (bytes) Throughput Write (bytes) Latency Read (sec) Latency Write (sec)
Search DB Logs 14.67 1,777.29 0.01 901,126 64,557,167 0.3060 0.8550
Temp DB 1,110.98 1,492.01 0.74 72,808,827 97,770,866 1.6870 3.5660
Query file group 3,507.26 1,631.96 2.15 148,370,386 126,034,214 3.4360 3.2140
Crawl file group 3,043.93 371.65 8.19 104,533,884 10,261,624 15.0840 15.8720

For comparison purposes I have included the current IOPs, throughput and latency numbers that I am getting on SearchBeta. These numbers are for comparison purposes and should be useful as a starting point for configuring your hardware.  However, you should make sure that you test you pre-deployment environment to verify that you are within the recommended latency ranges. It also recommended that you baseline your production system and periodically spot check to verify that you have not deviated away from your baseline.  Overtime you will see growth in the amount of content you are indexing and the volume of queries being executed.  Having a baseline in place and a process to verify the live system latencies will allow you detect problems sooner.      

I've dedicated most of this post to I/O intentionally as this will be the key bottle-neck that you will want to architect the system for.  The white paper Planning and Monitoring SQL Server Storage for SharePoint discusses memory a little more.  But in general; if you are deploying a larger scale (> 10 million indexed documents) MSS deployment then your SQL box should be OK with 16GB of RAM.  However, if you are deploying a MOSS environment where you will be hosting People, Usage Analysis and other MOSS features you will want to start with 32GB.

That is all for this post.  I know you are all waiting for information on SQL maintenance and information on how to create the Crawl and Query file group.  These are the next items on my priority list and I hope to have the details out soon    

-Thanks

Dan Blood
Senior Test  Engineer
Microsoft Corp