A friend of mine asked me about common algorithms for determining string similarity. One of the most well-known string similarity algorithms is the Levenshtein edit distance algorithm, possibly because it’s frequently used in computer science algorithm courses as a classic example of dynamic programming. Levenshtein edit distance computes the number of insertions, deletions, or substitutions it would take to go from one string to another. For example, the distance between “dog” and “dogs” is 1, because we can insert an ‘s’ at the end of “dog” to get to “dogs”. The distance between “puppy” and “lucky” is 3, because we can swap the ‘l’ for the first ‘p’, the ‘c’ for the second ‘p’, and the ‘k’ for the third ‘p’. Etc. More information on the Levenshtein edit distance is available in wikipedia.

I wrote up an implementation of the algorithm in C# for comparing strings, but then I realized that with generics it would be very easy to extend it to support any IEnumerable<T> of equatable types. This makes it easy to determine, for example, the edit distance between two arrays of integers. For anyone interested, here’s the resulting code:

/// <SUMMARY>Computes the Levenshtein Edit Distance between two enumerables.</SUMMARY>

/// <TYPEPARAM name=”T”>The type of the items in the enumerables.</TYPEPARAM>

/// <PARAM name=”x”>The first enumerable.</PARAM>

/// <PARAM name=”y”>The second enumerable.</PARAM>

/// <RETURNS>The edit distance.</RETURNS>

public static int EditDistance<T>(IEnumerable<T> x, IEnumerable<T> y)

where T : IEquatable<T>

{

// Validate parameters

if (x == null) throw new ArgumentNullException(“x”);

if (y == null) throw new ArgumentNullException(“y”);// Convert the parameters into IList

instances

// in order to obtain indexing capabilities

IList<T> first = x as IList<T> ?? new List<T>(x);

IList<T> second = y as IList<T> ?? new List<T>(y);// Get the length of both. If either is 0, return

// the length of the other, since that number of insertions

// would be required.

int n = first.Count, m = second.Count;

if (n == 0) return m;

if (m == 0) return n;// Rather than maintain an entire matrix (which would require O(n*m) space),

// just store the current row and the next row, each of which has a length m+1,

// so just O(m) space. Initialize the current row.

int curRow = 0, nextRow = 1;

int[][] rows = new int[][] { new int[m + 1], new int[m + 1] };

for (int j = 0; j <= m; ++j) rows[curRow][j] = j;// For each virtual row (since we only have physical storage for two)

for (int i = 1; i <= n; ++i)

{

// Fill in the values in the row

rows[nextRow][0] = i;

for (int j = 1; j <= m; ++j)

{

int dist1 = rows[curRow][j] + 1;

int dist2 = rows[nextRow][j – 1] + 1;

int dist3 = rows[curRow][j – 1] +

(first[i – 1].Equals(second[j – 1]) ? 0 : 1);rows[nextRow][j] = Math.Min(dist1, Math.Min(dist2, dist3));

}// Swap the current and next rows

if (curRow == 0)

{

curRow = 1;

nextRow = 0;

}

else

{

curRow = 0;

nextRow = 1;

}

}// Return the computed edit distance

return rows[curRow][m];

}

In addition to generics, this implementation also showcases a new feature of C# 2.0, the null coalescing operator (??). The ?? operator returns the left-hand operand if it is not null, otherwise it returns the right-hand operator. I use it in the above code in order to get IList<T> representations of the supplied IEnumerable<T> instances passed as parameters. For computing the edit distance, I needed the capability to index into each of the lists being compared, but I also wanted to support any enumerable type, such as System.String (which implements IEnumerable<char>). I can easily create two new List<T> instances from the supplied parameters, but that could be an expensive operation if the lists provided to the method are long, as building a List<T> from an IEnumerable<T> requires copying all of the data from the latter into the former (if the IEnumerable<T> is an ICollection<T>, this might result in a straightforward array copy, but if it doesn’t, it results in enumerating through every element in the IEnumerable<T> and Add’ing each individual item to the new List<T>). Instead, I can save a lot of time and effort if the IEnumerable<T> instance already implements IList<T>. If it does, I can simply cast the IEnumerable<T> to IList<T>, and if it doesn’t, I can take the expensive path of creating the new List<T>. The ?? operator makes this easy. Rather than having to write code like:

IList<T> first;or:

if (x is IList<T>)

{

first = (IList<T>)x;

}

else

{

first = new List<T>(x);

}

IList<T> first = (x is IList<T>) ? (IList<T>)x : new List<T>(x);or:

IList<T> first = x as IList<T>;I can simply write:

if (first == null) first = new List<T>(x);

IList<T> first = x as IList<T> ?? new List<T>(x);Nice and neat.

Cool, never knew about the ?? operator ðŸ™‚ Thanks.

thanks a much for that one man ðŸ™‚ i removed templating in my case, which was super easy, so you really made my day ðŸ˜‰

The Levenshtein algorithm is very precise, but it’s very slow. It has O(n*m) complexity. In a duplicate finding situation, where you have to compare at least a few thousand entries with themselves, you get to have N^2*n^2 operations where N is the number of entries and n the average word length.

But humans recognize duplicates in a glance, and humans are slow, so there must be a faster, less precise algorithm for finding string distance. I have tried to implement one and I’ve created the Sift2 algorithm: http://siderite.blogspot.com/2007/01/super-fast-string-distance-algorithm.html . In my opinion it works very much like Levenshtein, at least where words are concerned, and it’s really fast. Complexity O(n*constant) where default constant is 5.

I’ve received remarcably little input about it, though.

Greate stuff – I was able to get this converted to VB.NET quite easily.

What’s your license on this, Stephen? Can I snag it for personal use, if I give you credit?

Excuse me, an example of how to use this method to compare two strings will be very useful, tnx

Awesome! This is what I've been looking for. Nice start on add distance to my application metrics.

Strings can be compare as

string first = "welcome";

string second = "weldome";

and by calling method like

int distance = EditDistance<Char>(first.ToCharArray(), second.ToCharArray());

Hi Stephen,

Many thanks for this excellent implementation! I'm using it to good effect in the "Fuzzy Search" feature of Entrian Source Search ( entrian.com/source-search ).

— Richie Hindle

Very good implementation. The only thing I've changed is the swapping of current and next row to be like:

// Swap the current and next rows: 0->1 , 1->0 .

curRow = ++curRow % 2;

nextRow= ++nextRow % 2;

// Swap the current and next rows

nextRow = curRow;

curRow = 1 – curRow;

ðŸ™‚