Quick note about index backup

As you may already know, the only way to restore an index without having to recrawl is to use the out-of-the-box backup mechanism either in the UI or via STSADM catastrophic backup option. Using STSADM allows you to schedule the backup, but I was asked a question the other day that I couldn't immediately answer.

Does STSADM allow you to backup only selected items such as the index or do you have to perform a backup of everything?

The concern was that in our current design the collab sites share the same farm with the SSP and that all the collab content would have to be backed up in addition to the SSP using the out-of-the-box mechanism. Obviously, that wouldn't be good for a large farm where you depend on some sort of differencing mechanism to ensure you can backup all content within the backup window. Well I tested it and found that the SharePoint folks were on top of this (as expected, right :-) ) and created an STSADM option that allows you to chose which item you want to backup just like you do in the GUI. The syntax is:

stsadm -o backup -directory <SharePath> -backupmethod <full | diff> -item <object name>

In our case   <object name> is the SSP name which allows us to backup only the content needed to restore the ssp/index. Of course, you can chose any item you can see in the UI backup option, including individual content databases.

A note about capacity and backup performance. The MSIT index contains about 25 million items and the component sizes are as follows:

Search DB: ~370GB

SSP DB: ~65GB

Index (on the file system) ~150GB

When backing up this content (selecting just the SSP in backup) I'm told I need about 700GB of storage, but you'll notice that the cumulative size of this content is about 585GB. I'm not quite sure why it needs the additional 100GB of space, but I do know that the backup only consumes 413GB after it completes. More investigation is needed to understand the differences, however, my buddy Sam Crewdson tells me that DB fragmentation contributes greatly to the overall backup size. When he defragment's the DB, the size of the backup files are reduced.

Regardless, backup performance seems to be a factor of four things. How large is the content, what are the network limitations, what are the hardware limitations, and where is the backup share. For the longest time, MSIT backed up the index to a share on the index server. The problem with this approach is that the search DB (residing in SQL) is usually the largest component and has to be streamed from the SQL server across the network to the index server. That's not cool. Even with Gig/E, it can take quite a while to transfer that much data. A much better approach is to put the share on the SQL server. Now only the index file, which is usually the smaller of the two, has to cross the network. After making the change IT saw a major decrease in backup duration. In the neighborhood of 60%.

So how long? Well in my lab with nothing else going on, I can backup the aforementioned SSP in about 4 hours. That's about 100GB/hour. That's with some pretty awesome hardware. Your mileage will vary of course.

Mike