Comprehending Comprehensions

Not long ago, I was reading through some articles posted on programming.reddit.com when I came across an article claiming that C# is trying to be a dynamic language.  One user posted a comment that mentioned that C# 3.0 included among other things "embedded SQL".  Unfortunately, it seems that there is still some confusion about what query expressions are.  Let's be clear: query expressions are not embedded SQL.  In fact, they are closer to list comprehensions in languages like Haskell and Python than they are to embedded SQL.

At a high level, query expressions provide nine basic operations (from, join, join...into, group...by, orderby, where, let, select, into).  Each of these basic operations is expressed as a clause.  The clauses are then chained together to provide a declaration of query; however, if you only view a query then you can't be sure what the behavior at runtime will be.  The reason is that queries are translated into a series of method calls but the methods that are invoked are not predefined methods from the compiler's point of view.  The compiler just creates the calls and binds the calls to whatever methods are applicable given the arguments and the receiver of the call.  This is both the source of the beauty and the confusion of query expressions.

Syntactic Sugar

Query expressions are an interesting language feature because they are pure syntactic sugar.  But to really understand queries we need to see what lies beneath.  Let's break through the syntactic sugar.

Consider the following query:

    1:  var q = from x in foo
    2:          where x > 0
    3:          select x + 1;

 This query is transformed into:

    1:  var q = foo.Where(x => x > 0)
    2:             .Select(x => x + 1);

But what "Where" and "Select" methods are being called?  It is impossible to tell from this code snippet; it depends on what the type of "foo" is and which extension methods are in scope.  This is the beauty of query expressions and why we can do neat things like LINQ to Objects, DLINQ, and LINQ to Entities and so on.  Each of these providers can author their own "Where" and "Select" methods which would be called if foo was a DLINQ object or a LINQ to entities object.  If you wanted the queries to behave differently all you need to do is define your own methods such that the compiler will bind to them when you write a query.  The set of methods that needs to be implemented is called the "Query Pattern" and implementers of this pattern are called "Providers".

Transformation Rules

Their are (only ;) eighteen rules that define how queries are transformed.  Here is how the transformations can be performed given that you have a query (each transformation lists the pattern that to match first and then the resultant transformation on the next line):

First remove query continuations.

 1.   If the query has a continuation rewrite it (only rewrite the first one right now)

from x1 in e1 ... into x2 ...

from x2 in (from x1 in e1 ...) ...

Now transform the query into a list of clauses and for each clause if it is explicit then make it implicit like rules 2-4 state:

 2.   Explicit from

from T x1 in e1

from x1 in e1.Cast<T>()

 3.   Explicit join

join T x1 in e1 on k1 equals k2

join x1 in e1.Cast<T>() on k1 equals k2

 4.   Explicit join into

join T x1 in e1 on k1 equals k2 into g

join x1 in e1.Cast<T>() on k1 equals k2 into g

Now we check for the degenerate query and rewrite it in a special way.

 5.   Now if the query is degenerate then do a special transform

from x1 in e1 select x1

e1.Select(x1 => x1)

Finally, now that we have a list of clauses (without explicit types) we simply apply rules 6-18 repeatedly until we don't have a query any more. Think of it as pattern matching. If the list of clauses matches the given pattern then reduce the list given the transformation rule. Rinse and Repeat.

 6.   Two froms followed by a select

from x1 in e1 from x2 in e2 select e3

e1.SelectMany(x1 => e2, (x1, x2) => e3)

 7.   Two froms followed by a non-select

from x1 in e1 from x2 in e2 ...

from * in e1.SelectMany(x1 => e2, (x1, x2) => new { x1, x2 }) ...

 8.   From followed by a join followed by a select

from x1 in e1 join x2 in e2 on k1 equals k2 select e3

e1.Join(e2, x1 => k1, x2 => k2, (x1, x2) => e3)

 9.   From followed by a join followed by a non-select

from x1 in e1 join x2 in e2 on k1 equals k2 ...

from * in e1.Join(e2, x1 => k1, x2 => k2, (x1, x2) => new { x1, x2 }) ...

10.   From followed by a join with into followed by a select

from x1 in e1 join x2 in e2 on k1 equals k2 into g select e3

e1.GroupJoin(e2, x1 => k1, x2 => k2, (x1, g) => e3)

11.  From followed by a join with into followed by a non-select

from x1 in e1 join x2 in e2 on k1 equals k2 into g ...

from * in e1.GroupJoin(e2, x1 => k1, x2 => k2, (x1, g) => new { x1, g }) ...

12.  From followed by orderby (for some number of keys)

from x1 in e1 orderby k1, k2, k3 ...

from x1 in e1.OrderBy(x1 => k1).ThenBy(x1 => k2).ThenBy(x1 => k3) ...

13.  From followed by where

from x1 in e1 where e2 ...

from x1 in e1.Where(x1 => e2) ...

14.  From followed by let

from x1 in e1 let x2 = e2 ...

from * in e1.Select(x1 => new { x1, x2 = e2 }) ...

15.  From followed by non-identity select

from x1 in e1 select e2

e1.Select(x1 => e2)

16.  From followed by identity select

from x1 in e1 select x1

e1

17.  From followed by non-identity group by

from x1 in e1 group e2 by e3

e1.GroupBy(x1 => e3, x1 => e2)

18.  From followed by identity group by

from x1 in e1 group x1 by e2

e1.GroupBy(x1 => e2)

Example Translation

Consider the following query:

    1:  from x in foo
    2:  where x > 0
    3:  from y in bar
    4:  select x + y

To translate this query, we first remove check to see if rule 1 is applicable (no) and then we translate the query into a list of clauses while applying rules 2-4 (which don't apply to this query).  After rules 1-4 we have the following list of clauses.

[from x in foo, where x > 0, from y in bar, select x + y]

Rule 5 does not apply because it is not the degenerate identity select.  Now we will apply rules 6-18 repeatedly until the query is fully reduced.  The rules are always applied to the front of the list and so the first rule that is applicable is rule 13.  Which states:

from x1 in e1 where e2 ...

from x1 in e1.Where(x1 => e2) ...

Applying this rule the list becomes:

[from x in foo.Where(x > 0), from y in bar, select x + y]

Now the next rule that is applicable is rule 6 which states:

from x1 in e1 from x2 in e2 select e3

e1.SelectMany(x1 => e2, (x1, x2) => e3)

The list now becomes:

[foo.Where(x > 0).SelectMany(x => bar, (x, y) => x + y)]

We no longer have a query so we are done and the final rewrite looks like:

    1:  foo.Where(x => x > 0)
    2:     .SelectMany(x => bar, (x, y) => x + y)

Other queries are translated in a similar fashion.  As you translate more of these, you will notice that the rules are in fact very simple and can be easily mechanically applied.  With some practice, you should be able to translate them quickly in your head much like doing math.  It is simply a process of applying the rules in the right order and reaching a fixed point.

You may have noticed that rule 1 actually creates a nested query.  The rewrite rules as I explained them did not remove this nested query.  After rewriting a query, recurse and rewrite queries found in its subexpressions.  And that is it! 

Implications

Interestingly, the C# spec does not specify much more than this about the semantics of query expressions; however, the rewrite rules do imply some significant details that may not be apparent.  Some of these details include scoping rules, error message and debugging behavior, and even hidden new language features like transparent identifiers.

Happy Querying!