SQL FTS IFilters, its security implications and troubleshooting PDF Filter issues

Today I am going to talk about FTS IFilters, its security considerations and some basics on troubleshooting issues with PDF Filters. Read on and be enlightened…

What is a Word-Breaker?

It is a component within the FTEngine that’s responsible to perform linguistic analysis of the data. In other words, it’s responsible for breaking down words to their simplest form which we call a “token”. This is how we create and populate the FTIndex which are stored in your FTCatalog.

The word-breakers available for SQL Server out of the box are located in the %\MSSQL.X\MSSQL\Binn\ directory. LangWrbk.dll is the English language word breaker.

Eg.1: database can be broken down into “data” and “base” as both are proper words in the English Dictionary.

Eg.2: SQL 2008 has an in-built DMV that you can use to see how a given search string can be broken down.

SELECT* FROM sys.dm_fts_parser('"This is test" AND “This also"',1033,0,0)

Group_id

keyword

occurrence

Special_term

Display_term

Expansion_type

Source_term

1

0x0074006800690073

1

Noise Word

This

0

This is test

1

0x00690073

2

Noise Word

is

0

This is test

1

0x0074006500730074

3

Exact Match

test

0

This is test

2

0x0074006800690073

1

Noise Word

this

0

This also

2

0x0061006C0073006F

2

Noise Word

also

0

This also

 

What is the 45-sec delay all about?
You might have experienced issues with fulltext queries in SQL running slow for the 1st time but subsequent executions are fast. This is a known issue described in this KB article - https://support.microsoft.com/kb/915850

Any binaries which are loaded into the MSFTESQL Service are expected to be signed. This check is done by having the verify_signature option turned ON by default. When this happens AND if the machine does not have internet access, there is no way to verify the signature. Internally there are 3 tries to verify the signature each with a 15sec timeout and that is why you see the 45-sec delay when you run your search query.

If this word-breaker is not used for 5 minutes, it gets unloaded to reduce resource usage. So the next time query runs it will experience a 45-sec time delay.

How does SQL manage the signature?

Any IFilter manufacturer has the option to sign their binaries using a Certificate. This has to be done when developing the binaries. We do the same using VeriSign certificates.

Eg: PDF, MSG formats are some popular Ifilters available on the internet by many 3rd party vendors.

What’s different with SQL 2005?

 SQL Server 2005 hosts its own full text service (MSFTESQL) whereas in SQL 2000, it relied on a shared OS component called Microsoft Search (MSSearch). This can be made use by applications like Exchange, Sharepoint etc to enhance their search capabilities. With a new service Microsoft came up with a multitude of options to control what can work and what cannot within the Full Text Search in SQL Server.

You can see the available options you have under “sp_fulltext_service

What should I do for security of SQL with respect to loading custom Filters?

Opening this server to Internet access is one option you can think about. I would recommend that you turn off verify_signature if you do not want the 45-sec delay. Opening up the server port 80 for internet access presents a far greater security risk than disabling verify_signature option.

If you have security mechanisms in place which guarantee that people cannot install any filters on the server (without proper permissions), there should not be any problem as far as security of MSFTESQL is concerned.

In fact some iFilters which are not signed only work if we turn off this check. There are many customers who are using this method.

You can check the current document types and filters usable by SQL server by running “select * from sys.fulltext_document_types

You can check the current settings of signature load using this query,

SELECT fulltextserviceproperty('verifysignature') -- By default 1SELECT fulltextserviceproperty('loadosresources') -- By default 0

Troubleshooting PDF Filter Issues

image 
While I am on this topic and talking about IFilters, many of you might be using PDF filters to index PDF documents in your database. Adobe recently introduced a 64-bit PDF Filter (more about this here - https://www.adobe.com/support/downloads/detail.jsp?ftpID=4025), and I’ve seen many people having issues using that. If you look at the FTLogs during indexing you might have come across this error:

2009-09-29 06:31:59.47 spid24s Error '0x80004005' occurred during full-text index population for table or indexed view '[ifilter].[dbo].[fts_table]' (table or indexed view ID '2137058649', database ID '5'), full-text key value 0x2E00700064006600. Attempt will be made to reindex it.

2009-09-29 06:31:59.48 spid24s The component 'PDFFilter.dll' reported error while indexing. Component path 'C:\Program Files\Adobe\Adobe PDF iFilter 9 for 64-bit platforms\bin\PDFFilter.dll'.

These errors indicate issues with loading and using the adobe PDF Filter. So make sure your system PATH environment variable has the following location added in it: C:\Program Files\Adobe\Adobe PDF iFilter 9 for 64-bit platforms\bin. Note: this might need a restart

If this does not, then your SQL full text indexing will fail as it cannot find the filter for PDF 64-bit.

Also, there is an IFilter available from FoxiT that is a native 64-bit filter. You might want to test your indexing with that. Read more about it here - https://www.foxitsoftware.com/pdf/ifilter/

If you want to find out how to identify the full-text document/row that failed to index, you can follow these steps to isolate it:

1) Find the full-text key column

USE DBNAME
GO
select name from syscolumns where colid=objectproperty(object_id('TBL_NAME'),'TableFulltextKeyColumn') and id=object_id(' TBL_NAME ')

The full-text key is the column you used when you defined full-text index && it may not necessarily be the primary key of your full-text table.

2) Convert the value to full-text key value

select convert(varbinary(900),keyColumn) as keyColumn, uniqueColumn from TBL_NAME

3) Once you have this you can filter this to find the row that failed by looking at the FTLOG and noting down the failed full-text key value,

E.g.

select convert(varbinary(900),keyColumn) as keyColumn, uniqueColumn from TBL_NAME
where keyColumn = 0x2E00700064006600

Apart from PDF, a place to check the other default and custom IFilters available to SQL server, is to look under the following registry location HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ContentIndex ,under "DLLsToRegister"

Some important points to keep in mind

1. Images embedded within PDF documents are not indexed and SQL cannot search for them.

2. Adobe’s iFilter 6.0 is not compatible with Windows Server 2003 64 bit version.

3. Starting with Acrobat and Reader 7.0.5, iFilter functionality is now bundled within the Acrobat and Reader products.

4. Also PDFFilt 64-bit works with both SQL 2005 & SQL 2008. This has been tested!

5. Microsoft Filter pack contains the IFilter for the latest Office 2007 document types like docx, pptx, xlsx etc. You can get it here -

https://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CC&displaylang=en

6. For indexing MSG files, you need to download the IFilter from the Windows Desktop Search add-in. It’s available here - https://www.microsoft.com/downloads/details.aspx?FamilyId=134ECBB0-C162-4D07-BEF3-0B602C4A79DD&displaylang=en or install Office 2007 which comes with MSGFILT.dll

Happy Troubleshooting J

Sudarshan Narasimhan
Technical Lead, Microsoft SQL Server CSS