Another project…

From the start of this semester I've been also working on what I call a "Name Normalization" Algorithm (or more generally "String Normalization"). What this is supposed to do is take an input of some number of strings and determine which ones are "the same". This sounds like a simple problem, but it really isn't. Take a look at Google's Britney Search Archive and you can quickly realize that this problem is huge. The idea of the algorithm is to be able to take in a set of strings and pick out the ones that are the same. So, for example, if your name is Rich, but I search for Richie I get all results for Rich as well. Granted this is one of the easier cases to handle. There are some screwballs in there... for example, November and December only vary by 3 characters. So it's not so easy to determine that these two strings aren't in the same "class".

In any case, I just finished about 1 hour ago the "Alpha Version 1.0" release. I tried it against the input from the Google Britney Search Archive, and I was able to get 588/592 matches without any problems. Actually those 4 that I missed were: breathny spears, betny spears, brethany spears, brethine spears. If you ask me, these are horrbily POOR spelling of Britney. Anyway, there is one parameter that I can tweak to include these 4 in my positive results too, so I think it's a pretty good Alpha release. Oh, I forgot to add that on average, it takes 3 to 5 milli-seconds to do a comparison. Not too shabby.

Please note that this was in no way, shape, or form re-engineered. This was entirely derived from various readings online, and some theories/concepts that I was thinking about. I justed used the Google data as a baseline check.

To return to the point... why did I call this "String Normalization". Well, actually the fundamental idea of this is normalizing functions on a graph. Here, you want to go through a whole bunch of strings (ie: functions) and normalize them to one common string (ie: one common axis spacing and min/max, typically [0,1]).

Comments (7)
  1. Andrej Kyselica says:

    Any chance you’ll make that code available? (Or is it already out there and I’m just missing it.)

    -Andrej Kyselica

  2. I haven’t posted source just yet since i also need to submit this for class. I’ll consider it once classes are done this term (which is in about a week).

  3. Jon says:

    Awesome, this has tons of potential. Are you considering using this in your photo album for name searches?

  4. Manfre says:

    Happy Birthday!

Comments are closed.

Skip to main content