How Linq to Objects Queries Work

If you have ever tried to step through a Linq to Objects query in the debugger, you may have been mildly surprised at the results.  It may have seemed as if the program had a mind of its own and ran certain expressions when it wanted to and not when it was supposed to.  Be assured, it is doing the right thing and the order of evaluation is giving you a glimpse of what actually happens when a query is evaluated.

Consider this query:

using System;
using System.Linq;

class Program
{
static void Main()
{
var foo = new[] { 4, 2, 1, 3, 0 };

    var q = from x in foo
let square = x * x
where square > x
select new { x, square };

    foreach (var item in q) Console.WriteLine(item);
}
}

Now consider where breakpoints can be set within this program:

Pop Quiz

Which order are the breakpoint spans visited in.  Try to guess the order of evaluation.  Of course you can verify the order in a debugger.

The order may of evaluation may surprise you.  Why does it behave like this?  Well, to really understand we need to translate it.  First, lets translate the query:

var q = foo.Select(x => new { x, square = x * x })
.Where(t0 => t0.square > t0.x)
.Select(t0 => new { t0.x, t0.square });

Now, lets change the extension methods to static method calls.

var q = Enumerable.Select(
Enumerable.Where(
Enumerable.Select(foo, x => new { x, square = x * x }),
t0 => t0.square > t0.x),
t0 => new { t0.x, t0.square });

Now, lets look at how Enumerable.Select and Enumerable.Where are implemented.  We can not include the "using System.Linq" and instead implement Select, Where, and Func ourselves.

using System;
using System.Collections.Generic;

delegate R Func<A,R>(A arg0);

static class Program
{
static void Main()
{
var foo = new [] { 4, 2, 1, 3, 0 };

    var q = Select(
Where(
Select(foo, x => new { x, square = x * x }),
t0 => t0.square > t0.x),
t0 => new { t0.x, t0.square });

    foreach (var item in q) Console.WriteLine(item);
}

  static IEnumerable<U> Select<T,U>(IEnumerable<T> sequence, Func<T, U> selector)
{
foreach (var item in sequence)
yield return selector(item);
}

  static IEnumerable<T> Where<T>(IEnumerable<T> sequence, Func<T, bool> predicate)
{
foreach (var item in sequence)
if (predicate(item))
yield return item;
}
}

Now consider what happens when we run this program which is essentially identical to our original program.  First, foo is assigned a reference to an array of ints.  Next since C# uses pass-by-value parameter passing by default, the arguments to the methods must be evaluated before the functions can be applied to their arguments.  So the innermost Select call is called first which creates a Select Iterator using foo as the source but note that it does not run the iterator.   This behavior is called lazy evaluation.  In some other world, in perhaps a different language, the iterator could immediately evaluate all of its results, cache them, and then return them when demanded; but, this is not what happens with iterators in C#.  When you call a method that is an iterator, the iterator is only constructed but not run.  Remember that the iterator object that was constructed implements IEnumerable and so in this case it can be used as the first parameter to the Where method.  Finally, the result of the Where method is passed to the final Select method.  The whole result is assigned to q.  Note, that q is not actually a realized result set.  In fact, q is object that can compute the result set when required.  Since, q is made up of iterators it will only do the minimum work necessary to return the next element when asked.

So in the foreach statement when the first element of q is requested, the outermost Select iterator begins running but it requires its source to return the first element which in turn runs the Where iterator which needs its source which in turn calls the innermost Select iterator which requires its source to be realized but that source is foo which simply returns 4 as the first element.  The innermost Select iterator then produces a tuple of x and the square of x and returns this result.  The Where iterator then checks the predicate against this first element and it passes so it returns the result.  The outermost Select iterator then produces a projection using this result.  Note that the minimum possible work was done to compute the first result of the query.

This works precisely as before for the second query result but when we ask for the third result things start to get a bit more interesting.  Again the outermost Select iterator calls the Where iterator which calls the innermost Select iterator which asks foo for its next element which is 1.  The innermost Select iterator makes a tuple with 1 and its square.  The Where iterator then checks the condition and it fails so it does not return the item but requests the innermost Select iterator to produce another result which is does and this second result then passes the Where predicate.  The outermost Select iterator then gets 3 and its square which are projected and returned as the third result of q.  It is easy to see that queries essentially become a pipeline where each computation is fed to the next and all are independent of the others as has been previously described.

There are several important implications based on how Linq to Objects queries actually work.

1.  Laziness enables the ability to work with infinite sequences

2.  Operations which require the entire source to be examined change the order of evaluation

1. Laziness and the Infinite

Here is a program which displays the first 10 positive odd numbers.

using System;
using System.Collections.Generic;
using System.Linq;

static class Program
{
static void Main()
{
var odds = from x in NaturalNumbers()
select 2 * x + 1;

    foreach (var item in odds.Take(10)) Console.WriteLine(item);
}

  static IEnumerable<int> NaturalNumbers()
{
for (var n = 0; ; n++)
yield return n;
}
}

Notice that the NaturalNumbers iterator never ends (Yes, it would overflow but lets ignore that for the sake of this discussion).  The query then says for each natural number project twice the number and add one which is the definition of odd.  If this query immediately evaluated the results then this program would not terminate since NaturalNumbers never terminates; however, the program does terminate.  When query is created, it only constructs the information necessary to evaluate the query later.  Then when we foreach over the query, the query returns one result at a time as they are required.  In fact, in this case we also add another computation step onto q in the foreach statement by saying that we will only take the first 10 results so only the first ten natural numbers were even created.  This is beautiful.  Lazy evaluation enables the ability to work with the infinite.  I will have a lot more to say about this topic another time.

2. Forced Evaluation

There are some operations that require whole source examination before a single result can be passed on to the next operation.  One such operation is orderby.  In order to determine the position of any element, the whole sequence must be examined.  Thus, if we add an orderby to our original query then the order of evaluation of the query will dramatically change.

var q = from x in foo
let square = x * x
where square > x
orderby x
select new { x, square };

If you run the query in the debugger now, you will notice that the let, where, and orderby operations are all performed before the first select operation is performed.  Subsequent elements will skip the let, where, and orderby and go straight to the select statement.  This is because the orderby operation forced the evaluation of all of the previous operations.  Subsequent results do not need to evaluate the let, where, or orderby operations because they have already been evaluated.

Other common operations that cause this behavior are join, join...into, and group...by clauses.

Corollary to (1) and (2) - Forced Evaluation and the Infinite

Now consider what would happen if we tried to use orderby with NaturalNumbers.  As you might expect, our program would never terminate.  To use the two together we would need to make sure that we only use a section of NaturalNumbers like NaturalNumbers.Take(10).  This is because even though queries are lazy, if operations are used which force the evaluation of all of the results then the benefits of being lazy are effectively removed.  So we must be careful when using infinite sequences.