INFORMATIONAL: FULL TEXT SEARCH ENGINE & FULL TEXT INDEX POPULATION INTERNALS

Well I’ve looked around in our blog and it pained me to see there were no posts on Full Text Search; an often misunderstood and under-utilized feature that SQL Server provides. Wait no more, here comes the first one (of many I hope) :) . In this post I’ve tried to make things simple & help understand how FTS works in SQL Server 2005.

For you Search enthusiasts and many bewildered DBA’s, you may have wondered as to what exactly is happening inside SQL Server when you try to populate a full text index. Where do you look to know if things are moving along, how to find what the stage of population and how to detect if things are not exactly hunky-dory.

These are some things that you need to know before we get into the finer details:-

1) A full-text index is not a normal index and it is not even remotely similar to a normal index on a table.

2) It is token based and is created & maintained by the MS SQL Full-text Engine (MSFTESQL.exe)

3) It stored in a compressed format and what’s more it’s not even stored inside your database files. The actual FT Index resides on your hard disk.

4) To build a full-text index there are lots of hands involved including the mother-ship (SQL Server), the FT-Engine and a 3rd process called the Filter Daemon (more on this later).

                                                                    clip_image002

Some of the things that are stored in a full text index are:-

a) Keyword – the token generated by the word-breaker.

b) ColID – represents the table + the column that is enabled for FTS.

c) DocID – Row equivalent in a FTIndex.

d) Occurrence – The offset of the keyword within the DocID.

The process of filling up these indexes is called as Population or Crawling. There are 2 ways for you to achieve this – Manual and Automatic (using change tracking mechanism). When population is initiated by the user, the data is pushed to MSFTESQL process by SQLSERVR. Based on the type of data it will break it down, translate, apply grammar rules and index the data. The grammar is decided by a component called the word-breaker which you choose when you setup the full-text index. An interesting thing happens here wherein 1 full text index can contain multiple fragments. At the end of this population, the FT Engine tries to make this into a single master index. This process is called master merge.

                                                          clip_image004

Think back to what I said sometime earlier about the index structure. The DocID and Occurrence are the values that are stored compressed. This is done to save space & IO cost. The master merge process is the one responsible for merging these fragments & Docid’s in them. This makes sense since it’s cheaper to access for a single processing rather than every time as to access DocID/Occ they have to be uncompressed, saved and compressed again.

So if you see something like this in your log, you now understand what has happened.

2008-11-04 12:38:44.47 spid23s Informational: Full-text Full population completed for table or indexed view '[DB].[dbo].[FTTable]' (table or indexed view ID '12345', database ID '5'). Number of documents processed: 7712. Number of documents failed: 0. Number of documents need retry: 0.

2007-10-09 20:12:51.73 spid21s Changing the status to MERGE for full-text catalog "SampleCataloge" (5) in database "SampleDB" (9). This is an informational message only. No user action is required.

When I said log, I meant the FTlog that is created for every catalog. By default this is located in the SQL Server Log Folder, the same place as the Errorlogs. Every population (manual & automatic), change tracking, master merge, failures are all logged in this FTLog.

The naming of these error logs is in the format : SQLFT <dbid> <catID>.LogN

E.g. If DBID is 6 and Catalog ID is 5, the log would be named SQLFT000060005.log

A small note on performance: Depending on the size of the catalog and the number of tables enabled for full-text indexing, the master-merge process can be I/O intensive. Now how do I find out when I am populating what is going on and how to figure out its not stuck anywhere.

1) Look at the Status column and these are some expected values:

select * from sys.dm_fts_index_population

Status 5 : Processing normally

Status 8 : Encountered row-level errors that will be retried

Status 7 : Has stopped processing

2) Look at the Status column and below are some expected values. Also the item count will indicate the number of items that are currently stored in the FTCatalog. Generally when a population is in progress this value is expected to increase.

select fulltextcatalogproperty(‘catalog_name’, ‘PopulateStatus’)

select fulltextcatalogproperty(‘catalog_name’, ‘ItemCount’)

Status 1 : Full population in progress

Status 6 : Incremental population in progress

Status 7 : Building index

Status 4 : Recovering

3) Another way is to track the size of the .CI/.PI files in the catalog folder. This is a manual process and makes sense when you are populating a large table with lot of documents. In this case, the longer the population goes on, these files are expected to grow in size & number.

Hope this post cleared things on the internals of the full text engine.

In the next post I will cover troubleshooting full text population & indexing failures (yes, IFilter issues too J) . Also touch upon full text performance best practices. If time permits I will try to talk about SQL 2000 FTS. Stay tuned for more …..

Sudarshan N
Technical Lead | Microsoft SQL Server