Part1 - SQL Full text Search and International Characters: Interesting Scenarios, solutions and deep-dive using DMV’s

As part of my work I come across varied scenarios that customers encounter with SQL Server. I recently had the opportunity to work on some interesting scenarios using SQL Full-text search and international characters (non-English). To most people, the concept of word-breaking especially with non-English characters is still a mystery. In this blog series, I will cover 3 scenarios where I was able to use some of the cool new features in SQL 2008/R2 to troubleshoot a user scenario. You might have run into something similar, so I thought I’d post it out here to clear any questions you had/have and hopefully impart some wisdom along the way. J

So here goes …

Scenario1 – Chinese words

Mr Customer: Hey, I have this problem with Chinese words not getting properly broken down in SQL Server 2008.
Me: Hmm, that’s sounds interesting. Chinese, did you say?
Mr Customer: Yes, Here is what I see happening

1. The Chinese word 网络 is being treated by full-text indexer as a single word. When I perform a full text search for the characters '网' and '络' individually, it does not return the above entry, but searching for '网络' together does.

 

Here is the XML where this is stored,
<LocalizedText xml:lang="zh-cn">网络</LocalizedText>

 

2. However, this word 子网seems to be treated by the full-text indexer as two separate words. So when I do a search for either of the following I get rows returned.


子网

Here is the XML where this is stored,
<LocalizedText xml:lang="zh-cn">子网</LocalizedText>

I went one step ahead and consulted with a Chinese speaking friend of mine and he says this is a bug and of the opinion that the phrase 子网 should also be treated as a single word, not separate words.

Me: Still more interesting. I did some poking around and here are the steps I took to figure it out.

I am a stranger to Chinese, so I translated these Chinese words into English with the help of Bing Translator! to understand what they mean.

网络 à Network
子网 à Subnet

Okay, now I’m a little better off knowing what these words mean. To test if the word-breaking & indexing are happening correctly, I setup a repro.

-- Step1. Table Creation. The column must be unicode and I am using NVARCHAR

CREATE TABLE [dbo].[chinese](

       [id] [int] NOT NULL,

       [data] [nvarchar](50) NULL,

CONSTRAINT [PK_chinese] PRIMARY KEY CLUSTERED

(

       [id] ASC

)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

GO

CREATE FULLTEXT CATALOG [chinese]

IN PATH N'D:\'

WITH ACCENT_SENSITIVITY = ON

AUTHORIZATION [dbo]

GO

CREATE FULLTEXT INDEX ON [dbo].[chinese](

[data] LANGUAGE [Traditional Chinese])

KEY INDEX [PK_chinese] ON [chinese]

WITH CHANGE_TRACKING AUTO

GO

-- Step2. Insert 2 rows. You must use the N (unicode identifier) when inserting data.

insert into dbo.chinese (id,data) values (1, N'网络')

insert into dbo.chinese (id,data) values (2, N'子网')

 

-- Step3. Run Full-text queries using the CONTAINS predicate

-- 1st query to search for the word 网 (Net)

select * from chinese where contains (data, '"网"')

-- 2 rows returned

-- 2nd query to search for the word 网 (Work)

select * from chinese where contains (data, '"络"')

-- 1 row returned

-- 3rd query to search for the word 子 (Sub)

select * from chinese where contains (data, '"子"')

-- 1 row returned

So it appears that the Chinese word-breaker is indeed working properly. I was testing this on SQL Server 2008 SP1 and my language for word-breaking was Chinese Traditional.

Next, I tested this using Chinese Simplified as word-breaking language, since I see in the XML zh-cn was specified, and bingo the word-breaking issue does indeed reproduce.

Okay, what next? Fear not SQL Server 2008 has very good DMV support for full-text search! To isolate this further, I checked the Full-text Index to see how these characters were being stored.

Simplified Chinese:
SELECT * FROM sys.dm_fts_index_keywords(db_id('PUBS'), object_id('dbo.Chinese')) 

keyword

display_term

column_id

document_count

0x5B50

2

1

0x7F51 

2

1

0x7F517EDC

网络

2

1

0xFF

END OF FILE

2

2

 

Traditional Chinese:

SELECT * FROM sys.dm_fts_index_keywords(db_id('PUBS'), object_id('dbo.Chinese'))

keyword

display_term

column_id

document_count

0x5B50 

2

1

0x7EDC 

2

1

0x7F51  

2

2

0xFF

END OF FILE

2

2

 

If you look closely at the above two outputs, the word-breaking is definitely different in Simplified vs. Traditional. Notice how the display_term has broken the 2nd into two separate tokens whereas the 1st is displayed as a single token.

But this definitely looks like a word-breaker issue with ChsBrkr.dll specific to Traditional Chinese. This is the way the words are supposed to be broken (by design!). Why do I say this?
Because for the word 子网, it does make sense that it should be a single word as individually it cannot be broken down separately.

Word-Breaking
Keep in mind, word-breaking is very much a language specific behaviour. The way each language has its own grammar, punctuation etc. are key factors used in word-breaking. The difference in behavior observed here, is due to the difference in grammar between Chinese Traditional vs. Chinese Simplified. After all, SQL Server uses the same natural language word-breakers that are used by Windows & Office.

While I’m on the topic of word-breaking, here is a sample query that you can use to identify the word-breaking language for each column in each table in a particular database. (You can use sp_msforeachdb to get this working for all databases in one go) .

SELECT tbl.object_id as [ObjectID],

tbl.name as [TableName],

col.name AS [ColumnName],

sl.name AS [WordBreaker_Language],

sl.lcid AS [LCID]

FROM

sys.tables AS tbl

INNER JOIN sys.fulltext_indexes AS fti ON fti.object_id=tbl.object_id

INNER JOIN sys.fulltext_index_columns AS icol ON icol.object_id=fti.object_id

INNER JOIN sys.columns AS col ON col.object_id = icol.object_id and col.column_id = icol.column_id

INNER JOIN sys.fulltext_languages AS sl ON sl.lcid=icol.language_id

To Conclude - Using the DMV's for word-breaking we were able to get a much more clear picture of "why" the full-text engine was behaving the way it was and we were able to see the very smallest of differences in word-breaking when the language for word-breaker changed between Simplified vs. Traditional.

In my next post I will talk about a similar user scenario that I came across, only this was to be on Germanic characters like German/Swedish/Norwegian. Stay tuned for more....