So what’s the deal with this whole C# 3.0 / Linq thingy? (Part 2)


In the last post i  discussed a little bit of background on why we wanted to introduce Linq, as well as a bit of info on what some basic C# Linq looked like.  In this post i’m going to dive in a little bit deeper to some other interesting things we’re introducing as well


Here’s the current example we’ve been using to drive the discussion along

        Customer[] customers = GetCustomers();
var custs = customers.Where(c => c.City == “Seattle”).Select(c => c.Name);

Now, so far that’s a very C#-centric way to do queries over data.  However, it’s still a little bit heavyweight.  What about a more query-like syntax to do the same that’s far more convenient?  Well, it turns out htat we have that as well:

   var q =
from c in customers
where c.City == “Seattle”
select c.Name;

This new query syntax is in fact just syntactic sugar that uses patterns to transform itself into the *exact* same C# query that i listed above.  In fact, this is the same way that we handle foreach (specifically by transforming it into a loop with calls to MoveNext, Current, Dispose).


Now, when looking at this you’ll almost certainly notice how it looks *almost*, but not quite like SQL.  And, you’ll probably be asking: “can’t you just make it look like SQL if it’s that close!  Is this just MS wanting to be a pain just for the heck of it??”  In this case, the answer is “No”.  One of hte problems with the straight SQL like approach is that we’d have to put the “select” first.  “Ok… what’s wrong with that” you say.   Well, let’s take a look:

   var q =
select c<dot>

Now, at this point, you’re constructing the final shape for this query.  You know you want to write “c.Name” and you’d like to utilize handy features like IntelliSense to help speed you up with typing that.  But you can’t!  Because you haven’t even stated where your data is coming from, there’s no way to understand what’s going on this early in the expression.  This is because in SQL the scope of a variable actually flows backwards.  i.e. you use variables before you’ve even declared this.  However, in C# you can only use something after it’s been declared.  So in order to better fit within this model (which has some very nice benefits), we made it so that from has to come first.  Beyond statement completion there are also issues of being able to construct large hierarchical queries in an understandable way.  Having the scope flow from left to right, top to bottom, makes that much simpler and brings a lot of clarity to your expressions.


Now what about projections?  They’re incredibly common operations in SQL.  You’re aways doing things like “select a, b, c” and in essence projection out the information you care about into these columns.  So how would we go about doing this sort of thing in C# 3.0?  Well, you could do this:

   var q =
from c in customers
where c.City == “Seattle”
select new NameAndAge(c.Name, c.Age);

but that’s a real pain.  Any time i want to project any information out, i need to generate a new type and fill all it’s gunk in.  That means writing the class somewhere.  Creating a constructor for it.  Creating fields and properties.  Implementing .Equals and .GetHashCode.  etc. etc.  yech.  Far too much work, error prone and causes API clutter.  So what can we do to alleviate that?  Well, in C# 3.0 a new feature called “Anonymous Types” comes to the rescue.  We can now write the following:

   var q =
from c in customers
where c.City == “Seattle”
select new { c.Name, c.Age };

What this is doing is projecting the customer out into a new structural type with two properties “Name” and “Age”, both of which are strongly typed and which have been assigned the values of their corresponding properties in “c”.  What’s the type of Q at this point?  Well, it’s an IEnumerable<???> where ??? is some anonymous type with those two properties on it.  BTW, it should now seem somewhat more obvious why the “var” keyword was added to the language.  In this case you cannot actually write down the type of “q”, but you need some way to declare it.  “var” comes to the rescue here.


So i could now write:

   foreach (var c in q) {
Console.WriteLine(c.Name);
}

and that would compile and run just file.


Now “wait a minute!” you’re saying.  “Is this some sort of late-binding thang where we’re using refelction to pull out this data?”  No sir-ee.  In fact, if you were to try and write:

   foreach (var c in q) {
Console.WriteLine(c.Company);
}

then you would get a compiler error immediate.  Why?  Well, the compiler knows that the anonymous type which you’ve instantiated only has two members on it (Name and Age), and it’s able to flow that information into the type signature of ‘q’.  Then when foreach’ing over ‘q’, it knows that the type of ‘c’ is the same structural anonymous type we created earlier in the ‘select’.  So it will know that it has no “Company” property and appropriately inform you that your code is bogus.  All the strong, static typeing of C# is there.  You are just allowed to exchew writing the type now and instead allow inference to to take care of all of it for you.  Users of languages like OCaml will find this immeditely familiar and comfortable.


Now, one thing that’s quite common in the object world is the usage and manipulation of hierarchical data.  i.e. objects formed by collection of other objects formed by collections of… you get the idea.  Now, say you wanted to query your customers to get not only the customer name, but information about the orders they’ve been creating.  You could write the following very SQL-esque query:

   var q =
from c in customers
where c.City == “Seattle”
from o in c.Orders
where o.Cost > 1000
select new { c.Name, o.Cost, o.Date };

We’ve now joined the customer with their own orders.  This would get the job done, but maybe it’s not really returning the information in the structure you want.  For one thing, the data isn’t grouped by customer.  So for every order made by the same customer you’re going to get a new element.  So let’s take it a little further:

   var q =
from c in customers
where c.City == “Seattle”
select new {
c.Name,
Orders =
from o in c.Orders
where o.Cost > 1000
select new { o.Cost, o.Date }
};

Voila.  We’ve now created a hierarchical result.  Now, per customer you’ll only get one item returned.  And that item will have information about all the different orders they’ve made that fit your criteria.  Now you can trivially create queries that get you the results you want in the exact shape you want.


Next up!  Drill downs into many of the specific new features that we’re bringing to the table.


But first: a teaser!  Say you have the following code:

   var customers = GetCustomersFromDataBase();
var q =
from c in customers
where c.City == “Seattle”
select new {
c.Name,
Orders =
from o in c.Orders
where o.Cost > 1000
select new { o.Cost, o.Date }
};

foreach (var c in q) {
//Do something with c
}


Did you know that you will be able to write that code in C# 3.0 and DLinq will make sure that that query executes on the DB using SQL?  It will then only suck down the results that matched the query, and only when you foreach over them.   That’s right.  That entire “from … ” expression will execure server side.  And it didn’t need to even be in “from” form.  If you’d written it as “customers.Where(c => c.City == “Seattle”).Select(c => c.Name)” then the same  would be true.  How’s that for cool.  Stay tuned and a later post will tell you how that all works!


Comments (31)

  1. Frans Bouma says:

    All nice and dandy, but you most of the time will have to define types anyway, for the simple reason that the place where you READ data isn’t always (most of the time) the place where you consume data. This means that where you consume data you call a method which retrieves the data for you (example: gui calls BL method).

    Though in the BL method, you can use ‘var’ all you want, but you can’t return it from the method, the method has to have a strong type. Which means that I have to define the type anyway.

    I’m developing O/R mappers for more than 3 years now and I know the necessity of ‘var’ seems to come from the fact that a dynamic list has to be stored in a statically defined type, and ‘var’ ‘solves’ that’. But only partly as I described above.

    I more and more get the feeling this ‘var’ construct is only useful when you directly read the data from the DB in your GUI tier for example. I don’t know if Microsoft actually investigated use-cases for this, but I know from my long experience with O/R mapper code and the people who use these things, they absolutely want to separate the data usage tier from the data producing tier, i.e.: they want to avoid at all costs that a GUI developer is able to make shortcuts and read/write to the db directly, thus bypassing BL code.

    We can discuss this deeper at the summit in a few weeks but I’d like to point out that it looks great at first, this ‘var’, but the more I think about it, it starts to look like a nice thing to demo and to use in small petprojects but is completely useless in multi-tier applications.

  2. I absolutely agree about var not being nearly as useful as it can be.

    Beyond that, about DLinq, are the queries being remoted & executed there, or is an SQL statement is prepared from the expression tree and sent to the database?

    My understanding was that it is the later case, not the first one, but you post seems to say otherwise.

  3. senkwe says:

    In your last two code samples, I take it there is a comma missing between c.Name and Orders?

  4. Douglas McClean says:

    Cyrus,

    I like this LINQ stuff a lot except for two things:

    1) DLINQ is great, but it is unfortunate that the object-database mapping metainformation has to live on attributes of the objects. This is unfortunate because I might use the same business type to map to multiple data stores (e.g. to a database for persistent storage and to XML for messaging, or even to a different database on the server vs. the smart client).

    2) It seems that LINQ query engines have to compile the queries to the underlying query language every time they are executed, which can’t be a cheap operation. This is because the query is bound with what would ordinarily be considered query parameters very early on. Thus if I declare a query to find all of a customer’s orders based on the customer ID, it seems like DLINQ will have to recompile that query to SQL at each execution (because it will be a different query, because the customer ID will be different in the general case). It seems that introducing something like QueryParameter<T> would allow the query to be defined with a "blank" to be filled in later, precompiled (by calling some method that instructs the query to figure out it’s SQL, and keep that in a cache, and maybe even cache a DynamicMethod for mapping types in and out), and then used multiple times later with different values. It seems like (not having measured) that this would be likely to be a significant performance win in many cases. It also seems that it might allow for earlier (compile time or static analysis time) checking of queries for satisfiability against the database metainformation.

    Keep up the good work!

  5. Cleve Littlefield says:

    I agree with the last comment about caching the query, for two other reasons.

    By parameterizing the query you can also get a cached execution plan and also be more secure because the dynamic data is passed as parameters.

    I hope that string concatenation is not the method used to build the final query. The dynamic pieces of the where need to be parameters and the whole query executed using sp_executesql for security.

    For a hierarchical query, how do we tune in case the SQL it generates is not optimal? We need a QA like tool that shows the final query output, shows queryplans, and lets us optimize.

  6. CyrusN says:

    Ayende: "Beyond that, about DLinq, are the queries being remoted & executed there, or is an SQL statement is prepared from the expression tree and sent to the database?

    My understanding was that it is the later case, not the first one, but you post seems to say otherwise. "

    What’s the difference? It’s an implementation detail that will be determined based on which vendor you choose. Some implementations might generate SQL and execute it on the server. Some might remote the entire expression object over to your server and have it execute in say a CLR you have on that side. Both are possible.

  7. Firedancer says:

    It really looks complicated to me. Given a choice, I would prefer not to learn a third language to integrate two languages. SQL can really be complex. Also, how do we represent OUTER JOINS or CROSSJOINS in LINQ?

    Looking at this, I will prefer the traditional SQL in Stored Procedures and access it via ADO.NET. Much easier to work on. Also, I think OR Mappers are easier to understand and more usable. Not sure about others.

  8. damien morton says:

    Heres a question about DLinq:

    given a query, such as:

    var q =

    from c in customers

    where c.City == "Seattle"

    select new NameAndAge(c.Name, c.Age);

    Now, normally this is converted into an expression using extension methods from the System.Query namespace.

    What would happen if a completely different implementation of those extension methods was imported instead of System.Query. Is DLinq pluggable to that extent?

  9. damien morton says:

    a question on anonymous types…

    I kinda like this new notation, but only as a way of bringing together various query results. Clearly, these anonymous types arent usefull beyond the scope in which they are created, except as a data-holding class that can be reflected over.

    It would be usefull to be able to constrain these anonymous types to be immutable and/or structs.

    var cust = [Immutable.Class] new { Name="foo", Address="bar"}

    var cust = [Immutable.Struct] new { Name="foo", Address="bar"}

    or something like that.

  10. damien morton says:

    One of the more interesting capabilities of SQL is the ability to consisely join two tables:

    select * from foo join bar on foo.a = bar.a

    The result of this join will include all the properties of foo and bar. If a new property is added to foo or bar, queries such as the one above dont need to be changed; the added properties propagate naturally through to the result.

    With LINQ, everything is strongly typed, and as far as I can see, all projections need to be fully specified. If a new property is added to a table, the programmer has to find all references to that table and update any projections in any queries it might participate in.

    Could be nasty.

  11. "This new query syntax is in fact just syntactic sugar that uses patterns to transform itself into the *exact* same C# query that i listed above."

    Will that pattern-transforming capability be exposed to the programmer, or is that an implementation detail? If you are here revealing that, in C#, metaprogramming will finally be possible, this particular SQL example is only a tiny glimpse of the many new possibilities.

  12. Sean Chase says:

    Cyrus, this is very cool stuff for sure! Honestly I share the same concern as Frans Bouma who has a really great O/R mapper product (LLBLGen Pro) that I’ve been using over the last several months for a client’s project. He has nailed an important issue: that most of the time there is either a logical or physical boundary between the data reader and data consumer. For example, imagine a common assembly to hold classes like Customer, Order and the UI calls a BL layer which both reference common.dll to use those entities. LLBLGen Pro uses an adapter to do this. I would absolutely love to see DLINQ provide this kind of model. Of course there is a market for the "unlayered" model that’s being demo’d today just like there is a market for people using the SqlDataSource control in ASP.NET with literal T-SQL values embedded right in the page…but that kind of approach is almost useless to me in most cases. Probably a better value for mort. 🙂

    Great post!

  13. kfarmer says:

    damien: "With LINQ, everything is strongly typed, and as far as I can see, all projections need to be fully specified. If a new property is added to a table, the programmer has to find all references to that table and update any projections in any queries it might participate in.

    Could be nasty.

    "

    IMHO, shouldn’t we consider this to be a view into the database? Just because a column is added to a table doesn’t mean we need to add it to the query. It’s up to the background objects to determine whether or not a column is required, and up to the developer to specify the ones he’s actually interested in dealing with.

  14. damien morton says:

    kfarmer: I take your point, but this is an example of a philosophical mismatch between a LINQ query and an SQL query.

    As a C# developer, I am comfortable with explicitly declaring all the properties of interest, however, SQL allows for those properties to be dealt with in bulk. Tutorial-D (Date’s alternative to SQL) allows more sophisticated operations on those bulk properties, such as S{ALL

    BUT CITY, NAME}.

    Operations for dealing with properties in bulk constitute a kind of dictionary algebra, with support for renaming, merging, and so forth. A general solution in this domain is messy, however you cut it.

    Im not sure this is a problem, but if it is, its a messy one.

  15. damien morton says:

    Jared:

    "Will that pattern-transforming capability be exposed to the programmer…?"

    Yes it will – if you assign a lambda expression to a variable of type Expression<T> you can then manipulate the expression tree to your hearts content.

    Not sure how select expressions are handled.

  16. duncan says:

    In your code snippet:

    var q =

    from c in customers

    where c.City == "Seattle"

    select new {

    c.Name,

    Orders =

    from o in c.Orders

    where o.Cost > 1000

    select new { o.Cost, o.Date }

    };

    Why is there no ‘var ‘ needed before ‘Orders = ‘ ?

  17. SlyW says:

    duncan: Because it is inferred from the "select new { o.Cost, o.Date }" portion of the ‘subselect’.

    That is my understanding of it.

  18. kfarmer says:

    damien:

    I think that’s because the philosophy of LINQ isn’t *about* SQL databases: those are a side-attraction.

    I think Microsoft made a mistake in focussing so much on making the SQL folks comfortable that they’ve now got a very hard time ahead in explaining the differences between the two. Part of that mistake was in publishing C-omega: everyone fell in love with it, even though it had some serious troubles in dealing with embedded SQL. Now everybody wants C-omega, despite the lower degree of applicability.

    I don’t have my LINQ machine in front of me, but have you tried just selecting the item? Something like:

    from c in Customers

    where c.name = ‘Bob’

    select new { Customer = c, IsBob = true }

  19. damien morton says:

    That query does work, I just tested it, but wont work in a regular database.

    I think that youre right about LINQ not being about SQL databases.

    A great part of our jobs as programmers is writing queries over our data structures, and the LINQ framework does a lot of that work for us.

  20. damien morton says:

    Oh cool!

    You can define your own relational operators, such as Where() specialised on types derived from IEnumerable<T>.

    This is a somewhat inane example, but compare Where() defined for IEnumerable<T> and an equivalent Where() defined for IList<T>. All you need to do is ensure your new extension method is available and type-resolution will select the most specific overload.

    public static IEnumerable<T> Take<T>(this IEnumerable<T> source, int count)

    {

    if (count > 0) {

    foreach (T element in source) {

    yield return element;

    if (–count == 0) break;

    }

    }

    }

    public static IEnumerable<T> Take<T>(this List<T> source, int count) {

    for (int i = 0; i < count; i++)

    yield return source[i];

    }

  21. kfarmer says:

    I think you can also do something like:

    Expression<Func<double, double, double>> Between(this double x, double low, double high) = (double low, double high) => (low < x && x < high);

    var foo = Customers.Where(c => c.Age.Between(18, 24.5));

    Lambda’s are fun creatures to play with.

    Again, my machine’s at home…

  22. CyrusN says:

    Firedancer: "It really looks complicated to me. Given a choice, I would prefer not to learn a third language to integrate two languages. SQL can really be complex. Also, how do we represent OUTER JOINS or CROSSJOINS in LINQ? "

    That’s why we gave you a choice. You don’t have to use this if you don’t want to.

    "Looking at this, I will prefer the traditional SQL in Stored Procedures and access it via ADO.NET. Much easier to work on. Also, I think OR Mappers are easier to understand and more usable. Not sure about others. "

    Again, that’s why there’s a choice.

  23. CyrusN says:

    Damien: "Heres a question about DLinq:

    given a query, such as:

    var q =

    from c in customers

    where c.City == "Seattle"

    select new NameAndAge(c.Name, c.Age);

    Now, normally this is converted into an expression using extension methods from the System.Query namespace.

    What would happen if a completely different implementation of those extension methods was imported instead of System.Query. Is DLinq pluggable to that extent? "

    The "from" syntax has *nothing* to do with System.Sequence. It is just syntactic sugar that converts it to a pattern. i.e. .Where, .Select.

    So if you’re using some other types that define their own Where/Select methods, then you’ll be fine. the "from" comprehension will just end up binding to those.

  24. CyrusN says:

    Damien: "a question on anonymous types…

    I kinda like this new notation, but only as a way of bringing together various query results. Clearly, these anonymous types arent usefull beyond the scope in which they are created, except as a data-holding class that can be reflected over.

    It would be usefull to be able to constrain these anonymous types to be immutable and/or structs.

    "

    Right now we’re just showing a preview of what we’re working on. there are currently limitations in place in that preview (like not being able to pass "Var" out of a method. We’re actively investigating different approaches to this problem (such as being able to define anonymous types easily so that you can pass them around.

  25. Ruben says:

    Is it just me, or would the XQuery-like syntax

    var q = for c in customers

    ……. where c.City == "Seattle"

    ……. select c.Name;

    not be more C#-ish? I could even go for

    var q = for c in customers

    ……. where c.City == "Seattle"

    ……. return c.Name;

    But that would probably be a little confusing. Replacing from with for saves you a keyword, and could silence all the complaints about wanting SELECT FROM instead of FROM SELECT.

  26. CyrusN says:

    Ruben:

    "Is it just me, or would the XQuery-like syntax

    var q = for c in customers

    ……. where c.City == "Seattle"

    ……. select c.Name;

    not be more C#-ish? I could even go for"

    That would be fine.

    "var q = for c in customers

    ……. where c.City == "Seattle"

    ……. return c.Name;"

    I like that less. The return would be extrmely confusing.

  27. kfarmer says:

    How about

    "for c in customers

    where c.City == "Seattle"

    yield c.Name;"

    Yield fits in with the generator idea that this is making use of.

    One might also use foreach instead of for, if an unambiguous syntax can be had in that case.

  28. Mark Blomsma says:

    The scope of ‘var’ obviously needs to be greater than just within a method.

    Some queries will get to be very long. How can you refactor this if your scope is limited?