IEnumerable Distinct through a Func extension method


In a recent posting, available here, I tried to explain why the IEnumerable Distinct() method employs an IEqualityComparer implementation. In this posting I will provide a simple extension to IEnumerable for the Distinct() method that allows one to use a Func to determine a property on which the Distinct() method can operate.

The extension method approach provides a much simpler process for when distinct operations are needed on single property values.

Using the Product type outlined in the previous posting, this will allow the following Distinct() operation:

IEnumerable<Product> distinctProducts =
products.Distinct<Product, string>(prod => prod.UniqueIdentifier);


The UniqueIdentifier property is a read-only value that determines the product uniqueness:
public string UniqueIdentifier
{
get
{
if (this.ProductVersion == null)
{
return this.Reference;
}
else
{
return string.Format("{0}:{1}-{2:00}-{3:00}",
this.Reference,
this.ProductVersion.VersionName,
this.ProductVersion.VersionNumber,
this.ProductVersion.SubVersionNumber);
}
}
}

 
The complete implementation for this Distinct() extension method is as follows:
public static class EnumerableExtensions
{
public static IEnumerable<TSource> Distinct<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> getKey)
{
Dictionary<TKey, TSource> dictionary = new Dictionary<TKey, TSource>();

foreach (TSource item in source)
{
TKey key = getKey(item);
if (!dictionary.ContainsKey(key))
{
dictionary.Add(key, item);
}
}

return dictionary.Select(item => item.Value);
}
}

The Func in the Distinct() method calls provide a mapping from the source type, TSource, to the property type, TKey, on which the distinct comparisons will be performed.

In this implementation the determination of distinctness is made through the use of the Dictionary<TKey, TValue> generic class.

For each item in the Enumerable source a dictionary lookup is performed using the property value calculated in the Func expression. If no entry is found then the item is added to the dictionary, otherwise the duplicate item is skipped and the next item processed. Using this methodology, the dictionary will ultimately contain only distinct elements, accessible through the final Select() projection.

So how does the performance of the two options compare? As a comparison I executed 500,000 Distinct() method calls over a list containing 9 items with 3 duplicates. The IEqualityComparer approach took 2 seconds and the Func extension took 10 seconds; showing the hashing method is faster.

However if you just need to perform Distinct operations on single properties for small collections/lists, then the extension method is a much simpler programmatic approach; with reasonable performance.

Hopefully you will find this implementation useful.

Written by Carl Nolan

Comments (2)

  1. John Kraft says:

    Nice post.  Very helpful.  I've been looking at this issue for a while.  I used your solution, but made some minor changes. I don't know if you will find them useful at all.  If not, then just delete this

    1) I used the HashSet and it's built in features for the if statement.  It seems a little more light weight for me.  

    2) I used the yield return statement to return the item.  

    I executed the Dintinct() call, as you did, 500,000 times over your list of 9 items with 3 duplicates, and the total time was 3.9ms.  I must be running the test completely different, because that time is not even in the same ball park as yours.

    My code is as follows…

    public static class EnumerableExtensions

       {

           public static IEnumerable<TSource> Distinct<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> getKey)

           {

               HashSet<TKey> keys = new HashSet<TKey>();

               foreach (TSource item in source)

               {

                   TKey key = getKey(item);

                   if (keys.Add(key))

                   {

                       yield return item;

                   }

               }

           }

       }

  2. Carl Nolan says:

    The times you are seeing are probably due to deferred execution. Do a ToList of the result in your loop and you will see the times rise as it will force the Distinct to execute.

    I did have a yield solution for an IList and i always got about 4ms execution.