Word breakers and morphological analysis

When we think of the Natural Language Group at Microsoft, most users associate our technology with the Spelling and Grammar checkers in Word; however, because the nature of spelling and grammar checking requires the ability to tokenize and analyze input text, our team also has tokenizers which we refer to as “word breakers.” That is, the first step in analyzing input text is to determine where one word ends and another word begins. In reality, it’s not always a word that is identified, but rather, a linguistically meaningful unit. For example, an email address such as “joe@blah.com” might be considered three words with some separating punctuation; however, as a meaningful unit the entire email address is important as a whole. In non-Indo European languages such as Japanese, word breaking is more complex because there are no spaces to easily identify word boundaries, but rather, the context of character sequences must be analyzed to identify boundaries.

 

At any rate, these word breakers are very important to Search in order to know which units or terms to index. And on top of word breaking in Search, there is another process of morphological analysis that can be helpful. Morphological analysis has the ability to associate a particular form with related forms. For example, if a user does a search for the word “coffee table”, there may be pages in which “coffee tables” is mentioned. It is morphological analysis that allows for the association of the word “table” with “tables”.

 

Morphological analysis typically refers to one of two processes: In the first process, an input form is associated with a base-form which is sometimes referred to as “lemmatization” or “stemming”. For example, associating the input word “playing” with the base form “play”. In the second process, an input form is associated its inflections. For example, if the input form is “play”, then the additional forms of “plays”, “played”, and “play”, would be associated. In languages other than English, associating inflections can generate hundreds of forms (e.g., in Turkish a noun can have thousands of related inflections!).

 

Just some food for thought about some of the things we get to think about.

 

-- Jay Waltmunson (Program Manager)