From the start of this semester I’ve been also working on what I call a “Name Normalization” Algorithm (or more generally “String Normalization”). What this is supposed to do is take an input of some number of strings and determine which ones are “the same”. This sounds like a simple problem, but it really isn’t. Take a look at Google’s Britney Search Archive and you can quickly realize that this problem is huge. The idea of the algorithm is to be able to take in a set of strings and pick out the ones that are the same. So, for example, if your name is Rich, but I search for Richie I get all results for Rich as well. Granted this is one of the easier cases to handle. There are some screwballs in there… for example, November and December only vary by 3 characters. So it’s not so easy to determine that these two strings aren’t in the same “class”.
In any case, I just finished about 1 hour ago the “Alpha Version 1.0” release. I tried it against the input from the Google Britney Search Archive, and I was able to get 588/592 matches without any problems. Actually those 4 that I missed were: breathny spears, betny spears, brethany spears, brethine spears. If you ask me, these are horrbily POOR spelling of Britney. Anyway, there is one parameter that I can tweak to include these 4 in my positive results too, so I think it’s a pretty good Alpha release. Oh, I forgot to add that on average, it takes 3 to 5 milli-seconds to do a comparison. Not too shabby.
Please note that this was in no way, shape, or form re-engineered. This was entirely derived from various readings online, and some theories/concepts that I was thinking about. I justed used the Google data as a baseline check.
To return to the point… why did I call this “String Normalization”. Well, actually the fundamental idea of this is normalizing functions on a graph. Here, you want to go through a whole bunch of strings (ie: functions) and normalize them to one common string (ie: one common axis spacing and min/max, typically [0,1]).