Improve User Experience in Enterprise Search Step By Step - Part IV - Relevancy Tuning by WordBreaker

We have been talking about XSL/XML for so long a time. Now we want to give relevancy a shot.

Hey, relevancy?

Yes, relevancy.

Relevancy is the most important thing for a search engine, more important than the page numbers it crawled, more important than the result update interval in most of the case - because users always look at the top results.

Relevancy is a very complex problem. It is affected by many factors, it is quite different in different languages. In this article, we will take English and Chinese for example. (I know a little German and Japanese as well but ...)

In this series we will go through wordbreaker, weighting, and other useful stuff. Because I'm now in a Karaoke party, I cannot describe everything in detail. I assume you already know how to deal with Bestbets and Did you mean feature. If you want to have some other information, please read Luca Bandinelli's multilingual whitepaper.

Wordbreaker

Wordbreaker is the first issue if you hate a search engine. although word breaking in Latin languages(English, Spanish, French, German, Dutch...) is much easier than that in other languages(Chinese, Japanese, Korean, Arabic...), it's still a boring thing to deal with.

MOSS/MSS comes with many wordbreakers, but sometimes you may not be very satisfy with them. Is there any 3rd party word breaker I can use? Yes, some Microsoft partners have quite a long history in delivering word breaking technology to production use. For example, Hylanda is a leading Chinese word breaking technology company. They did a pretty good Chinese wordbreaker for SQL Server 2000. Since we didn't change the interface of wordbreaker even in MOSS/MSS, it can be used directly here.

To change a wordbreaker, you need to do the following things.

a. Register the wordbreaker.

This depends on the installation manual of the wordbreaker:). But generally speaking, it should be something like this:

regsvr32 YourWordBreaker.dll

b. Get GUID string of your new wordbreaker. We will need it in the next step.

Search for your wordbreaker dll's name in registry, and you will find something located in CLSID branch. For example:

HKEY_LOCAL_MACHINE\SOFTWARE\Classes\CLSID\{4474fffd-da87-4116-9be9-874939d2bd04}

Copy this guid string for further usage.

c. Navigate to the branch of your language. Replace the values with your wordbreaker.

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\LanguageResources\Default\YourLanguage

WBDLLPathOverride is the path of your wordbreaker dll. In my case, my wordbreaker is located at C:\Hylanda\HlChsBrKr.dll

WBreakerClass is the GUID string you just got.

snap119

Don't forget to restart your search service by net stop osearch, net start osearch. Then do a FULL crawl for all the content source. If the wordbreaker did the crawl job mismatch with the new installed wordbreaker, it will result a bad search result because of query time word breaking.

Sorry I can't post the images of search results, they are still in testing process. But I can tell you the improvement is HUGE.