Starting last Friday I ended up working on a StringNormalizationList data structure that encapsulates the StringNormalization class that I was working on before. Essentially this data structure allows you to keep adding strings to it, and then you must Clean the structure. The cleaning process involves making “a set of sets”. This “set of sets” is a list of all the sets of words that the algorithm deam to be similar (or rather the same). This data structure now lets me go through a long list of strings and “bin” them into the appropriate buckets. Definitely pretty useful.
There are some optimizations that I’d like to do. In particular I’d like to find some sort of hash function that hashes similar string values to similar hash keys. I’ll pose the question here… Does anyone know of such a hash function?