I recently received this question about Fuzzy Lookup behavior (paraphrased):
We are seeing something very strange within the Fuzzy Lookup component. When you have different ref tables, both containing the same value, but one having a lot more data, you get different similarity scores. So if the name “Jo Bloggs” comes in and you compare it with a ref table that only has “Jo Bloggs”, and a ref table that has two million other names as well, the scores will be different.
I’m not an expert with the Fuzzy components, but this didn’t sound all that odd to me. Because the component will look at the entire reference set, it makes sense that that could affect the end similarity score. I asked our expert in Microsoft Research, Kris Ganjam, for a definitive answer, and this is what he said:
Fuzzy Lookup gives weight to each word based upon how frequently that word occurs in the reference table. Frequent words are given lower weight. This allows "Microsoft Corporation" to be close to "Microsoft Corp" and far from "Boeing Corporation". It uses Inverse Document Frequency (IDF) weighting which is standard in information retrieval and which lies at the heart of most search engines.