LINQ Farm: Using Distinct and Avoiding Lambdas

This is the fourth in a series of articles on LINQ. This article focuses on an important operator from the list of 49 LINQ operators available in the May CTP. This operator, called Distinct() , is different from the other operators we have seen because it is called directly.

This post will focus on five related ideas that will be explored and explained in depth:

  1. Most query operators such as Select(), Where() and GroupBy() take something called a lambda as a parameter.
  2. Lambdas are difficult to write.
  3. Query expressions were created in large part to allow developers to use LINQ without having to learn the complex syntax associated with lambdas.
  4. A few query operators, such as Distinct(), do not take lambdas as parameters. As a result, they are easy to call.
  5. Query expressions were therefore not created for operators such as Distinct() that do not take lambdas.

This post will explain these ideas in depth.

Overview

In this post, I will continue to avoid tackling lambdas directly. The C# team created a syntax called query expressions which allow you to write both simple and advanced queries without using lambdas.

Many discussions of LINQ that I have seen on the web start by focusing on lambdas. I believe this approach will make LINQ hard for many developers to understand. As a result, I'm telling you about everything except lambdas.

You do not need to understand lambdas in order to harness the power of LINQ. Lets start with what is easy to understand, and then later move on to more difficult subjects.

It is exciting to finally see a way to query data that is built directly into a computer language. When you fire up the compiler up and start using the technology directly, you will see that it is fun to play with the language, and fun to see what it yields.

Unless you really love complex syntax, don't get hung up with lambdas and expression trees. Just have fun with LINQ. It is a great way to write queries.

Groups of Operators

As you recall from the previous post, the code in our sample programs rely on an in-memory "table", which is really just a collection or list of objects of type Operator:

    1:  class Operator
    2:  {
    3:    public int OperatorID;
    4:    public string OperatorName;
    5:    public string OperatorType;
    6:  } 

There are 49 different operators in the May CTP. Therefore, the OperatorList "table" has 49 rows in it. Here is a look at 12 sample rows which will give you a sense of the structure of this "table":

Operator OperatorName OperatorType
1 Where Restriction
2 Select Projection
3 SelectMany Projection
4 Take Partitioning
5 Skip Partitioning
6 Range Generation
7 Repeat Generation
8 Empty Generation
9 Distinct Set
10 Union Set
11 Intersect Set
12 Except Set

The OperatorType on the right illustrates a scheme for grouping the various types of operators. Trying to keep track of 49 different operators is a chore, so the team grouped related sets of operators together into 14 different categories, or operator types. In the table above, you can see that Range, Repeat, and Empty are part of the Generation category. In this post, we will frequently focus on the Distinct operator, which is part of the Set category.

Exploring the OperatorTypes

In Listing One, you can see a few of the different ways of exploring the OperatorTypes from our table. This post and the next will focus on the methods found in this listing.

Listing One: A few of the many different ways to begin exploring the OperatorTypes

 

    1:  using System;
    2:  using System.Collections.Generic;
    3:  using System.Text;
    4:  using System.Query;
    5:   
    6:  namespace QueryLister
    7:  {
    8:      class GroupsAndSets
    9:      {
   10:          private List<Operator> operatorList;
   11:          private System.Collections.IList list;
   12:   
   13:          public GroupsAndSets(System.Collections.IList list)
   14:          {
   15:              operatorList = Operators.GetOperatorList();
   16:              this.list = list;
   17:          }
   18:   
   19:          private void Display(IEnumerable<string> s)
   20:          {
   21:              foreach (var value in s)
   22:              {
   23:                  list.Add(value);
   24:              }
   25:          }
   26:   
   27:          public void DistinctWithLambda()
   28:          {               
   29:              var s = operatorList.Select(p => p.OperatorType).Distinct();
   30:   
   31:              Display(s);
   32:          }
   33:   
   34:          public void SimpleDistinct()
   35:          {
   36:              var s = (from p in operatorList                     
   37:                       select p.OperatorType).Distinct();
   38:   
   39:              Display(s);
   40:          }
   41:   
   42:          public void DistinctOrdered()
   43:          {
   44:              var s = (from p in operatorList
   45:                       orderby p.OperatorType
   46:                       select p.OperatorType).Distinct();
   47:   
   48:              Display(s);
   49:          }
   50:      }
   51:  }

I will focus on five methods from Listing One:

  • SimpleDistinct
  • DistinctOrdered
  • DistinctWithLambda
  • NotWhatYouExpected

In future posts, I will discuss the GroupBy operator. I will also "normalize" our data by placing the OperatorTypes in their own "table."

The Distinct Operator

The SimpleDistinct method demonstrates the easiest way to get a look at the unique entries in the OperatorType field of our "table." The output from the SimpleDistinct method is shown in Figure One.

Figure One: The SimpleDistinct method shows all the unique instances of the OperatorType field from the OperatorList "table". The Distinct operator itself is part of the Set group.

The code on lines 36 and 37 asks the compiler to "find all the unique instances of the OperatorType from the OperatorList 'table.'" Or, you could say "From the OperatorList select all the OperatorTypes that are distinct." 

If you go look again at lines 36 and 37, you will see that it is almost easier to read the code directly than it is to try to turn it into a proper English sentence. This simple syntax is the beauty of query expressions, and one of the key features of the LINQ technology.

Why is the Distinct Operator Called Directly?

Query operators are method calls. In other words, there are methods in the LINQ API called Select(), Group(), Distinct(), etc. We don't usually call these methods directly because they take lambdas as parameters, and many people find that lambdas are hard to understand. To help developers avoid the complex task of writing lambdas, the team invented query expressions, which are a "syntactic sugar" that sit on top of lambdas.

Look at line 29. There you can see a call to the Select operator that takes a funny looking parameter: Select(p => p.OperatorType). The parameter, which I show in blue, is a lambda. You can tell it is a lambda because it includes that funny looking "goes to" operator:  => . To read it out loud, you would say "p goes to p dot OperatorType."

Query expressions create lambdas behind the scenes, without forcing developers to compose them directly. We write query expressions that use terms like select, where, and group. Here is an example from an earlier post:

    1:  from p in OperatorList
    2:  select p;

This a query expression. There are no direct calls to the Select() operator in this code. Here is another example:

    1:  from p in operatorList
    2:  group p by p.OperatorType
    3:  into MyGroup 
    4:  select MyGroup.Key;

Again, there are no direct calls to Group or Select operators. Behind the scenes these query expressions are converted into calls to Select() and Group() , each of which take a lambda as a parameter .

If you look at the code for the SimpleDistinct method you will see that it is a little bit different from the other query expressions we have looked at so far. In particular, parenthesis are used to bind together the query expression proper, and then a method called Distinct() is called on the result of that query expression. The code below shows the query expression itself in blue, and the direct call to the Distinct() operator in red:

(from p in operatorList
select p.OperatorType).Distinct();

Look back at the methods from previous posts, or look at the Group and GroupOrdered methods from this post, and you will see that this pattern is atypical. Normally we hide direct calls to query operators behind query expressions.

To understand why this method is different you need to remember that query expressions were created to save you from having to type lambdas. A query expression syntax was created for almost all the cases where you would pass a lambda as a parameter to query operator.  

The Set operators do not take lambdas as a parameter. Distinct is one of the Set operators. The other set operators are Union(), Intersect() and Except().

Since they do not take lambdas as parameters, the C# team decided that the Set operators do not need to hide behind query expressions. In effect, they said, "Oh look. It's easy to call a Set operator like Distinct, so we won't bother to create a query expression for it." In fact, it is easier to just directly call these methods than it would be to invent a query expression syntax for them.

NOTE: Throughout this section, I have emphasized that lambdas are difficult to use. You've probably noticed, however, that the DistinctWithLambda method is not really that much more complex than the SimpleDistinct method. In fact, it takes 69 characters to write the SimpleDistinct method, and 66 characters to write the DistinctWithLambdaMethod. In this particular case, the lambda we are dealing with is simple, and easy to write. However, lambdas can easily become complex and lengthy, while query expressions are typically shorter and easier to read.

Danger: Not What you Would Expect

I'll end this post by examining the NotWhatYouExpectDistinct method:

    1:  public void NotWhatYouExpectDistinct()
    2:  {
    3:      var s = from p in operatorList 
    4:          select p.OperatorType.Distinct();
    5:   
    6:      foreach (var value in s)
    7:      {
    8:          list.Add(value);
    9:      }
   10:  }

This method looks similar to the SimpleDistinct method:

    1:  public void SimpleDistinct()
    2:  {
    3:      var s = (from p in operatorList                     
    4:               select p.OperatorType).Distinct();
    5:   
    6:      Display(s);
    7:  }

The difference between the two methods is that one uses parenthesis to set off the query expression, and the other does not. If you don't use the parenthesis, then the Distinct() method gets bound to p.OperatorType, and you end up with the output in Figure 3. This is probably not what you want. Be sure to use parenthesis when working with Distinct()!

Figure Two: Oh my gosh, I forgot the parenthesis!

Alphabetizing the Output from Distinct

The output shown in Figure One is not arranged alphabetically. The DistinctOrdered method eliminates this potential shortcoming by using the OrderBy operator:

    1:  public void DistinctOrdered()
    2:  {
    3:      var s = (from p in operatorList
    4:               orderby p.OperatorType
    5:               select p.OperatorType).Distinct();
    6:   
    7:      Display(s);
    8:  }

The majority of code here is identical to that in SimpleDistinct. On line 4, however you can see that a new operator has been introduced. This allows us to order the output, as shown in Figure Three.

Figure Three: The unique OperatorTypes arranged alphabetically.

The orderby operator shown on line 4 does exactly what you would expect it to do. It is intuitive and easy to use. It provides exactly the kind of simple syntax you would want to use to order a list like this one.

Summary

In this post you have been introduced to the Distinct. The syntax for using this operator is quite simple.

Much of this post, however, focuses on helping you understand the motivation for query expressions, and why there are times when you need to step past query expressions and call query operators directly. The Distinct operator, for instance, is called directly.

If you understand that query expressions are a means of hiding lambdas, and if you can see that some query operators do not take lambdas and can therefore be called directly, then you have assimilated the main point of this post.

Hopefully the simple process of working with simple query expressions has helped you begin to get a feeling for how LINQ is put together. Despite all the complex talk of lambda expressions and expression trees, LINQ is at heart a simple and easy to use syntax. We will continue to explore this fun tool in future posts. After a time, it should become easy for you to use in your own programs.

 

del.icio.us tags: c#, csharp, LINQ

LINQQueryLister02.zip