Stemming Functionality in Full Text Search using CONTAINS

Many a times you may have encountered issues related to Full Text Search functionality in SQL Server wherein you won’t be able to search words with different verb forms. These different verb forms are also called as Stemmers. For a given language, a stemmer generates inflectional forms of a particular word based on the rules of that language. Stemmers are language specific. When you use Full Text Search, most of the times you would use CONTAINS predicate to search for the words. Contains is a predicate used to search columns containing character-based data types for precise or fuzzy (less precise) matches to single words and phrases, the proximity of words within a certain distance of one another, or weighted matches. Now, the problem is whenever you try to do a search for a particular word SQL should have resolved the different forms of that word but practically which is not the case. You would think that there is a problem in the Full Text Engine and try to rebuild the Full Text catalog but still encounter the same issue. However, everything works well when you try using LIKE operator. So what is the problem? You would think as if this is an issue with your SQL Engine or Full Text catalog or the language that you are using for the full text indices but still you have no luck!

I also encountered similar kind of issues, did some research work and got explanations related to this problem.

The query I was using was like the following:

SELECT * FROM TableName WHERE CONTAINS (ColumnName, ' "word" ') GO

Scenario:

In the table I had word "CHAINS" and whereas I was searching for word "CHAIN"! So, whenever I was searching for CHAIN it is not giving me expected result. I recreated the problem on sample database. I created a FT Index on Production.Product table on Name column.

Ran following command:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, ' "Chain*" '); GO

Results:

Name

--------------------------------------------------

Chains

(1 row(s) affected)

--This result was expected. Remember here I am using a "simple_term" along with "*" to define a phrase to for search condition.

NOTE: We are getting the expected result because "*" specifies that all columns in the table registered for full-text searching should be used to search for the given contains search condition and <simple_term> specifies a match for an exact word or a phrase.

Then I ran following command:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, ' "Chain" ') -------- Removing * GO

Results:

Name

--------------------------------------------------

(0 row(s) affected)

Why this result?? Are you really expecting an output after running the above statement??

So, the point is why FT engine is not using Stemming at all if we don't use "*" with simple_term in CONTAINS predicate. So, the explanation of this behaviour is that if you want to use CONTAINS predicate and want to make use of language dependent stemmer, you should use a parameter called INFLECTIONAL in the CONTAINS predicate. INFLECTIONAL would cause Full Text Engine to use language specific stemmer for the specified simple terms. Stemmer behaviour is defined based on stemming rules of each specific language. The neutral language does not have an associated stemmer. The column language of the column(s) being queried is used to refer to the desired stemmer. If language_term is specified, the stemmer corresponding to that language is used.

Now, if I modify my query like the following:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, 'FORMSOF (INFLECTIONAL, Chain)'); GO

Results:

Name
--------------------------------------------------
Chains

(1 row(s) affected)

This is irrespective of whether I use asterisk "*" or simple_term. This is the result that I want!! But this result will not distinguish the difference between the use of INFLECTIONAL parameter and asterisk "*" with simple_term. So, you would obviously think that why should I use INFLECTIONAL when I am getting the same result using "*"? To verify my above theory, I tried one more test. I added 3 more rows in the table where Name= Ran\Run\Running. The intention behind this test is to find whether INFLECTIONAL is able to search the verb forms of the word as well or not and whether we get the same results if we use asterisk "*" with simple_term.

SELECT Name FROM Production.Product WHERE CONTAINS(Name, '"Run*"'); GO

Results:

Name

--------------------------------------------------

Run

Running

(2 row(s) affected)

--This is an expected result because we don't expect Full Text to give "Ran" as one of the outputs.

Then used following:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, 'FORMSOF (INFLECTIONAL, Run)'); GO

Name

--------------------------------------------------

Ran

Run

Running

(3 row(s) affected)

WOW!! I got the verb forms for the word Run!!

So, here comes the difference between "*" with simple_term and INFLECTIONAL. When we use INFLECTIONAL we are we are actually searching the Verb and Noun forms of the word we are searching and hence utilizing the correct functionality of the Stemmer. Whereas when we use a given <simple_term> within a "*" will not match both nouns and verbs.

So, here INFLECTIONAL comes to your rescue :)

Another piece of information that you may be interested in would be related to usage of FREETEXT predicate.

FREETEXT is a predicate used to search columns containing character-based data types for values that match the meaning and not the exact wording of the words in the search condition. When FREETEXT is used, the full-text query engine internally performs the following actions on the freetext_string, assigns each term a weight, and then finds the matches"

1. Separates the string into individual words based on word boundaries (word-breaking).

2. Generates inflectional forms of the words (stemming).

3. Identifies a list of expansions or replacements for the terms based on matches in the thesaurus.

So, FREETEXT predicate will give you the same stemming functionality as that of INFLECTIONAL in CONTAINS predicate. If you use the above example using FREETEXT you will see same output as that of CONTAINS:

SELECT Name FROM Production.Product WHERE FREETEXT (Name, 'Run');
GO

Results:

Name

--------------------------------------------------

Ran

Run

Running

(3 row(s) affected)

Now, you may argue why one should use INFLECTIONAL in CONTAINS predicate as against FREETEXT?? So, the answer to this is: Full-text queries using FREETEXT are less precise than those full-text queries using CONTAINS. The SQL Server full-text search engine identifies important words and phrases. No special meaning is given to any of the reserved keywords or wildcard characters that typically have meaning when specified in the <contains_search_condition> parameter of the CONTAINS predicate.

Also, remember that stemming is fully dependent on language you have set on the column on which you have created the full text index [Language for Word Breaker]. This means if you have a Korean as a Language for Word Breaker for a column, the stemming would depend on Korean language and not on any other language.

For example I changed the Language for Word Breaker to Korean from English in the above example and the result of the query is:

SELECT Name FROM Production.Product WHERE CONTAINS(Name, 'FORMSOF (INFLECTIONAL, Run)'); GO

Result:

Name

--------------------------------------------------

Run

(1 row(s) affected)

Important point to notice here is to wait until the Full Population of the FT Index is completed after you make changes to the properties index. Else, you might see the same result that you may be expecting.

PS: When you use Neutral Language word-breaker as opposed to English (UK/US), then it will give you a different result set. You will see different items showing up in the FT output as Neutral does not understand Grammar – so there is no cause for alarm, its expected behavior.

Another thing that could affect the output is whether the catalog is accent-sensitive or insensitive. Class example would be: cafe vs. café

select fulltextcatalogproperty('Catalog_Name','AccentSensitivity')

I hope this would give some relief to you if you were stuck on a situation similar to this.

Sumit Sarabai
SE, Microsoft Sql Server Reviewed by Mukesh Nanda
TL,Microsoft Sql Server

&

Sudarshan Narshiman
TL,Microsoft Sql server

 

Did this Blog help to resolve your issue.(opinion)