Creating Custom Dictionaries for special terms to be indexed 'as-is' in SQL Server 2008 Full-Text Indexes

 

Did You Know?:   That SQL Server 2008 provides the ability to create a Custom Dictionary of special terms to be indexed 'as-is' (i.e: 'AT&T') into the FTIndex?... 

 

Many times you find yourself in a situation where you would like to index a given word/term ‘as-is’ in your Full-Text Index. However, due the specific tokenization behavior from a given WB, it gets splitted in several individual terms. This might cause at query time recall issues.

 

For instance, your data contains the following term “ISBN-7”. Internally, most (if not all) WBs will tokenize this as ‘ISBN’ and ‘7’ separately. These two terms are going to be persisted in the FTIndex instead of the single and original “ISBN-7”. Note that because ‘7’ might be a default noise word for most languages (single digit numbers typically are considered as noise words), what happens is that the FTIndex will only index the term ‘ISBN’.

Later, at query time when you search for ‘ISBN-7’ using any of our available predicates (i.e: CONTAINS) or tvfs (i.e:CONTAINSTABLE), the same process will occur and the query will be internally translated to a query for ‘ISBN’ only. You will then receive all rows/documents containing ‘ISBN’, not only the one you cared about initially: ‘ISBN-7’.

 

By disabling STOPLIST for a given FTIndex, you are going to solve the missing results part of the problem, but you are going to have larger FTIndexes leading to performance implications and still not solving the problem of false positives, as you will still find documents containing 'ISBN' and '7' somewhere in the corpus of the document, but not necessarily together (next to each other) as you wanted originally.

This is a common problem when querying for email addresses for example, where the <'@'> character serves as separator . In this case, you are going to find document containing the exact email address as you desired, but as well documents containing both parts of the email separated for other special character different than <'@'> , leading to false positives again. There are other different variants of this behavior that can lead to false positives as well, etc.. While this behavior is expected, as you see, it can lead to undesired results.

 

In order to prevent this, SQL Server 2008 new WB family is introducing the ability to list a set of words/terms that should not be tokenized by a given WB, thus forcing them to be indexed as they are. These lists (called custom dictionaries) are simple text files and associated to each language supported by SQL Server Full-Text Search.

 

Here are the details in how to accomplish this:

 

(Reference: https://technet.microsoft.com/en-us/library/cc263242.aspx)

 

A custom dictionary is a Unicode-encoded file that can be used to specify words that you want the word breaker of the same language to consider as complete words. Custom dictionaries are not provided, by default. To modify the word breaker behavior for more than one language you must create a separate custom dictionary for each language for which you want to modify the word breaker’s behavior. You cannot create a custom dictionary for the language-neutral word breaker.

-Example created by Venkat, one of our team testers:

Consider the term “34590-97-8”. When this is passed to the German wordbreaker (for instance), the output terms are many:

select [display_term] from sys.dm_fts_parser('34590-97-8', 1031,0,0)

display_term

34590

nn34590

97

nn97

8

nn8

This output can be seen using:

select * from sys.dm_fts_parser('34590-97-8', 1031,0,0)

If you need to keep these numbers together for business reasons (I.e: this is a chemical code, account, phone number, etc) you then need to create a custom dictionary for German language in this case. Here are the steps:

1. Log on to the machine (on which sql server is installed) as a member of the Administrators group.

2. Start Notepad and type the words you want in your custom dictionary. Be sure to avoid invalid entries as described in the Reference link previously included. Remember that each word must be on a separate line and separated by a carriage return (CR) and line feed (LF).

3. On the File menu, click Save As.

4. In the Save as type list, select All Files.

5. In the Encoding list, select Unicode.

6. In the File name box, type the file name in the following format: CustomNNNN.lex, where NNNN is the language hex code of the language for which you are creating the custom dictionary (i.e: English= 1033). See Table 1, in the document referenced before in this post for a list of valid file names for supported languages and dialects.

7. In the Save in list, navigate to the folder that contains the word breakers. By default, this is the sql Binn directory. Example C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn.

8. Click Save.

9. Restart fdhost by executing “exec sp_fulltext_service 'restart_all_fdhosts' ” on the sql server instance.

Now the custom dictionary will treat the term differently as a single word:

select [display_term] from sys.dm_fts_parser('34590-97-8', 1031,0,0)

display_term

34590-97-8

Currently, these text files only support a list of words and not more complex expressions, as for instance regular ones, etc.. (i.e: ISBN-*).

Important Note: This will only work with newly shipped SQL Server 2008 Word Breakers and not with older word breakers unchanged in our 2008 release. These new wordbreakers are also shipped with the OS in Vista+ operating systems. These wordbreaker dlls have names as NaturalLanguage6, NLS*.dll etc. As a notorious exception, the english wordbreaker shipped with SQL Server 2008 is identical to our 2005 version, thus it does not belong to this new category of Word Breakers capable to support a custom dictionary. In order to use a custom dictionary for English in SQL Server 2008, you will need to change the registry registration and indicate that the English WB existing in you OS (only if it is Vista+) is the one that should be used for your SQL intance from that moment on. Please follow this article in how to change this registration for a given WB or iFilter:

https://msdn.microsoft.com/en-us/library/ms345188.aspx

The above article shows how to do it for Danish, Polish etc. To do this for the new English word breaker, you need the corresponding values for the new English Word Breaker and stemmer. The wordbreaker files to copy to your sql Binn directory are below. These should be in your windows\system32 folder on Vista+ operating systems.

1. NlsData0009.dll

2. NlsLexicons0009.dll

3. NlsGrammars0009.dlll

Other settings needed:

String value

TsaurusFile

tsEnu.xml

DWORD value

Locale

00000409

String value

WBreakerClass

{9DAA54E8-CD95-4107-8E7F-BA3F24732D95}

string value

StemmerClass

{61A48126-EF74-4d4a-9DDA-43FD542CAD1E}

 

 

We hope this information is useful to you. We are going to include this information in our official Books Online (BOL) release in upcoming SQL Server 2008 SPs as well as in our next major release.

 

Regards!