Building a LINQ IQueryable Provider – Part XII


This is the twelfth in a series of posts on how to build a LINQ IQueryable provider. If you have not read the previous posts you probably were born yesterday. How could you possibly make sense of this post without any context at all?  At least make an attempt. Sometimes I don’t know why I bother.


Complete list of posts in the Building an IQueryable Provider series 


<Insert standard disclaimer for why there have been no updates on this for ‘like’ forever>


It’s been so long since I last posted we must be up to Web 5.0 by now. I suspect there are not any actual web developers left. Its all just twitter application programming and cybernetic mind melds now. Does anybody still write code? In a text based language? Without an electric shunt wired directly into your cerebral cortex?  Nobody?


Sigh. I’m probably just talking to an empty internet; everyone’s moved on to Planck Space. But since I’ve got nothing better to do, I might as well get on with it.


IQueryable Toolkit


One of the first things you’ll notice when you take a look at the source is that I changed it quite a bit.  I moved code around, changed names gratuitously, added & removed classes and broke a lot of continuity with the prior versions. One of the biggest changes is that the code is no longer just a sample.  All those internal classes are public, the project builds as a DLL, the tests are hosted separately and the namespace is no longer ‘Sample.’  Why, oh why would I do this?  Simple.


I originally started the project to give developers a structured hands-on walk through of the construction of an IQueryable provider; hoping to inspire many of you to build your own. (Many of you have.) The source code was made available so you could learn from it and as a cheap way to get you to debug it for me. (Many of you haven’t.) And so you could take from it what you would, copy a class here, steal a method there, etc, to make your job easier to get your own LINQ to XYZ up off the  ground quicker.


Yet, I received so many requests to re-use the entire code base wholesale, that I now realize what you really want is a toolkit not just some sample code. As a toolkit it is still sample code. The sources are still there and you can take from it what you will.  Or you can build it into a shiny DLL and then build your code on top of it.  All the pieces I’ve built so far are publicly re-usable. Many of the classes have been enhanced for extensibility, so you can mix & match and override to your hearts content to build the system you want out of the pieces I already have.


How it breaks down


At the top there is a namespace ‘IQ’.  I thought that it was cleaver & consise. Maybe someone out there will come up with a better one, something catchy, like a cold, and I’ll feel compelled to catch it, fall under-the-weather for a few weeks and emerge blissfully sedated enough to go with it. Maybe. Under the ‘IQ’ namespace exists all the bits and pieces that are generally useful for any IQueryable impementation.  This is where you’ll find the basic ExpressionVisitor class, the generic Query and base QueryProvider class.  You’ll also get these:



ExpressionComparer – compares two expression trees for equality. immensely useful.


ExpressionWriter – Get a better translation of that narly expression tree into a  C#-ish syntax that you can actually read.  Helps debugging a lot.


Grouping – It’s that dead simple implementation of the LINQ IGrouping interface. 


CompoundKey – A class that helps you represent compound key values.


ScopedDictionary – ‘cuz I needed it and now you can have it too.  Works sort of like a Dictionary but with nested scopes.


TypeHelper – Used to be called TypeSystem (how pretentious of me).  Helps you know a few things about types.


DeferredList – An implementation of IList that enables deferred loading.  (Bells are probably ringing in your head about now.)


Nested inside this namespace is another one, called simply ‘Data’  (I know, original, huh). In here you’ll find all those other classes that made up most of my previous posts like the provider itself, DbQueryProvider, that works on top of any DbConnection.  To really understand what has gone on in my head you need to get into this class and look around.


Logically, it still does the same thing.  It primarily just implements the IQueryable.Execute method, converting query expressions into SQL commands, executing them using the DbConnection and translating the results into objects. That’s all still there, its just been re-arranged a bit and made a lot more extensible.


How it is extended is the interesting bit that flavors everything else to come. The DbQueryProvider’s brains are now fully pluggable. I took a look at what was going on during execution and broke it down into logically atomic steps. The prior version used to first translate the query, build a generic reader and then execute and return.  That’s still sort-of what happens, but there’s an even better break-down.


Queries are translated, but the translation goes through three distinct phases: 



1) mapping is applied


2) policy is enforced


3) the target query language has the last say. 


Each of these steps is pluggable.  How?  There are three new classes to get acquainted to.



QueryMapping – Defines information & rules to map an object model onto a database model


QueryPolicy – Defines information & rules to determine inclusion of related data & execution plans


QueryLanguage – Defines information & rules to adhere to a target language, including converting the query into text


Every DbQueryProvider is now supplied these three at runtime, each can be overriden by you to take control at different points in then translation and execution process. The mapping, policy & language can each override a portion of the query translation pipeline, injecting its own rules as rewrites of the query expression using the common DbExpression primitives provided.


In addition, the QueryLanguage controls the final translation of the tree to SQL text (or whatever target language is being used), and the QueryPolicy is invoked to build the execution plan.  Note, the execution plan is not the simple object-reader of yore.  It is now an entire runtime generated program for completely executing the database query & constructing objects out of the results.  The QueryPolicy is allowed to do whatever final rewrite is necessary to turn the translated query expression into an executable piece of code.


Of course, by default, these three mostly do what was being done before, and if you care to look you’ll see that the QueryLanguage’s Format method (for turning queries into text) simply defers to the TSqlFormatter class (which used to be called QueryFormatter.)  You can overidde this method and have it call your own formatter.  You can even override the TSqlFormatter and make a few minor changes if your back end SQL is largely the same.


If you want to implement a different strategy to retrieving hierarchical results, you can inject your own logic both during query translation and you can take over the entire process of building the execution plan.  The default BuildExecutionPlan on the QueryPolicy class defers to the ExecutionBuilder (used to be ProjectionBuilder in its former life), but you can change that and do it your own way.  Of course, all the source is available, so you can build your version based off of the one that exists in the toolkit.


The crazy logic for inferring mapping information by simply using the names of types and members is retained, but walled off in a specific implementation of mapping called the ImplicitMapping.  You can re-use this one if your needs are as meager as my demos.  Or you can build your own.


Of course, now that you’ve had a moment to think about it, given so much flexibility, you’ll inevitably start asking about what database providers are supported.  (Still just TSQL.)  You’ll also want to know what complex mapping models I’ve invented and how they compare to ones used by ORM’s.  (Still just the implicit demo one.)  And then you’ll want to know what I did to improve reading data hierarchies because that N+1 strategy is just a loser.  (Here’s where you finally get something out of me.)


Relationships and Loading


The query translation finally understands relationship properties. The QueryMapping understands the abstraction of mapping properties, these can map to columns or relationships (like associations) or whatever you want.  The mapping decides what’s possible.  What have I implemented?  The basic mapping understands association relationships only; those kinds of relationships that are made via a join matching one or more columns across two tables.


The interesting part (at least to me) is how relationships are dealt with during query translation and later during execution.  Does a one to many relationship property residing in the output projection lead to extra queries at execution time or not?  If so, how many? 


I’ve implemented a few rewriters that deal with relationships.  For example, the SingletonProjectionRewriter finds projections of singleton relationships (one to one or many to one) and folds them into the query itself using a server-side join.  A singleton query doesn’t change the cardinality of the results, so it is generally safe.  Yet, what happens when you refer to the same singleton relationship property more than once? Do you get the same join over and over again? To hold back the flood of redundant joins I had to come up with another rewriter, one that would find the duplicate uses and get them to place nice together, but because this situation can happen more often than just as the side effect of doing the singleton rewrites, I wrote it to work on joins and not just nested projections. The RedundantJoinRemover looks for identical joins expressed in the same query and throws out all the extra ones.  It does not look through into sub-queries, so its best to first removed redundant sub-query layers using the RedundantSubqueryRemover in hopes of forcing as many joins into the same layer.


The ClientJoinedProjectionRewriter attempts to do something very different. It looks to convert nested projections (ones that would become extra queries per object at execution time) into client-side joins. So you’ll get one join per included relationship. This is not the “Big Join” strategy that LINQ to SQL uses. It precisely the strategy that I still consider inferior if I’m asked to ensure correctness of the results. So why did I choose to implement it? I’m not being asked to insure the correctness of results, nor am I being asked to ensure that large result sets can stream back from the server. I assume that the normal degree of ambiguity of results you get any time you attempt to get related results via more than one query is okay in this case too. I also assume that a user will choose to employ a transaction to get better consistency. You can also disable this one easily if you want. You can implement a policy that uses one or more other strategies.


This is what the ClientJoinedProjectionRewriter actually does.  If the SelectExpression of a nested projection is constrained to the outer query via an equality comparison, like this column equals that parameter and so on I figure I can easily turn this into a client-side LINQ to object style join.  So I rewrite the query so it restates all the significant bits of the queries that came before it.  So if you wanted to retrieve customers in London and all their orders you’ll get one query for the customers in London and then another query for all the orders for all the customers in London. The idea is that the second query is actually run first and all the objects are stuffed into a lookup table. When the query for customers executes, it picks up the matching orders right out of the lookup table.  Neat.


Taking it for a Spin


Given a simple query for customers in London:

var query = from c in db.Customers
where c.City == “London”
select c;

And, assuming I have a policy that tells the provider that the Orders property on customers is supposed to be included whenever I ask for customers.  (The TestPolicy used by the supplied test project lets me pick and choose properties by name.)


When I start out, the provider first sees the query as LINQ expression tree:

Query(Test.Customer).Where(c => (c.City = “London”))


After mapping the query expression now looks like this: 

Project(
@”SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
FROM [Customers] AS t0
WHERE (t0.City = ‘London’)”
,
new Customer() {
City = A0.Column(“City”),
ContactName = A0.Column(“ContactName”),
Country = A0.Column(“Country”),
CustomerID = A0.Column(“CustomerID”),
Phone = A0.Column(“Phone”)
},
p => Queryable.AsQueryable(p)
)

It’s got the SelectExpression (formatted by default as TSQL), followed by an expression that constructs a customer out of the query’s result columns, and function that converts the presumed resulting IEnumerable<Customer> into the expected type. 


Next the QueryPolicy is asked to translate, so it applies rules like determining if a relationship property is included in the output or not.  Since I set up the policy to automatically include orders, I get a new expression that looks like this:

Project(
@”SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
FROM [Customers] AS t0
WHERE (t0.City = ‘London’)”
,
new Customer() {
City = A0.Column(“City”),
ContactName = A0.Column(“ContactName”),
Country = A0.Column(“Country”),
CustomerID = A0.Column(“CustomerID”),
Phone = A0.Column(“Phone”),
Orders = Project(
@”SELECT t0.CustomerID, t0.OrderDate, t0.OrderID
FROM [Orders] AS t0
WHERE (t0.CustomerID = t2.CustomerID)”
,
new Order() {
CustomerID = A1.Column(“CustomerID”),
OrderDate = A1.Column(“OrderDate”),
OrderID = A1.Column(“OrderID”)
},
p => Enumerable.ToList(p)
)
},
p => Queryable.AsQueryable(p)
)

Now I’ve got two queries, one nested inside the other.  If nothing happens to this, the inner query will get executed once per row in the outer query results.


Next, the QueryPolicy also applies the the ClientJoinedProjectionRewriter.  It attempts to convert nested projections into a query to fetch all the data (that will be executed once per outer query) with the resulting objects joined on the client machine.

Project(
@”SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
FROM [Customers] AS t0
WHERE (t0.City = ‘London’)”
,
new Customer() {
City = A0.Column(“City”),
ContactName = A0.Column(“ContactName”),
Country = A0.Column(“Country”),
CustomerID = A0.Column(“CustomerID”),
Phone = A0.Column(“Phone”),
Orders = ClientJoin(
OuterKey(A0.Column(“CustomerID”)),
InnerKey(A1.Column(“CustomerID”)),
Project(
@”SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
FROM [Customers] AS t0
OUTER APPLY (
SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
FROM [Orders] AS t2
WHERE (t2.CustomerID = t0.CustomerID)
) AS t1
WHERE (t0.City = ‘London’)”
,
Outer(
A1.Column(“Test”),
new Order() {
CustomerID = A1.Column(“CustomerID1”),
OrderDate = A1.Column(“OrderDate”),
OrderID = A1.Column(“OrderID”)
}
),
p => Enumerable.ToList(p)
)
)
},
p => Queryable.AsQueryable(p)
)

Now I have an embedded ClientJoin node.  Notice how the inner query has a join so that it will retrieve all the orders for all the customers in London now.  Unfortunately, its an OUTER APPLY join when it does not really need to be.  But don’t fear.  The QueryLanguage will soon fix it.  The combined query also gets a new column ‘Test’. This is used to determine if the join found at least one matching row or not. If no match was found instead of the value 1 the ‘Test’ column will be null. I can use this information to make sure the resulting collection for a non-match is empty.


Next the QueryLanguage get its shot at translating the tree.  When it does, it attempts to get rid of any APPLY node if at all possible; OUTER APPLY’s become LEFT OUTER JOIN’s and CROSS APPLY’s become INNER JOIN’s.

Project(
@”SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
FROM [Customers] AS t0
WHERE (t0.City = ‘London’)”
,
new Customer() {
City = A0.Column(“City”),
ContactName = A0.Column(“ContactName”),
Country = A0.Column(“Country”),
CustomerID = A0.Column(“CustomerID”),
Phone = A0.Column(“Phone”),
Orders = ClientJoin(
OuterKey(A0.Column(“CustomerID”)),
InnerKey(A1.Column(“CustomerID”)),
Project(
@”SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
FROM [Customers] AS t0
LEFT OUTER JOIN (
SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
FROM [Orders] AS t2
) AS t1
ON (t1.CustomerID = t0.CustomerID)
WHERE (t0.City = ‘London’)”
,
Outer(
A1.Column(“Test”),
new Order() {
CustomerID = A1.Column(“CustomerID1”),
OrderDate = A1.Column(“OrderDate”),
OrderID = A1.Column(“OrderID”)
}
),
p => Enumerable.ToList(p)
)
)
},
p => Queryable.AsQueryable(p)
)

Now the query expression is fully translated we only have left to build the execution plan.  The QueryPolicy is asked to turn the expression into a program that will do the actual execution.

Invoke(
null,
lookup1 => ((IQueryable<Customer>)ExecutionBuilder.Sequence(new Object[] {
ExecutionBuilder.Assign(
lookup1,
Enumerable.ToLookup(
Enumerable.Where(
((DbQueryProvider)Query(Test.Customer).Provider).Execute(
new QueryCommand<KeyValuePair<String,Order>>(
@”SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
FROM [Customers] AS t0
LEFT OUTER JOIN (
SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
FROM [Orders] AS t2
) AS t1
ON (t1.CustomerID = t0.CustomerID)
WHERE (t0.City = @p0)”
,
new String[] {“p0”},
r1 => new KeyValuePair<String,Order>(
r1.IsDBNull(0)
? null
: ((String)Convert.ChangeType(
r1.GetValue(0),
System.String
)),
r1.IsDBNull(1)
? null
: new Order() {
CustomerID = r1.IsDBNull(2)
? null
: ((String)Convert.ChangeType(
r1.GetValue(2),
System.String
)),
OrderDate = r1.IsDBNull(3)
? new DataTime(“1/1/0001 12:00:00 AM”)
: ((DateTime)Convert.ChangeType(
r1.GetValue(3),
System.DateTime
)),
OrderID = r1.IsDBNull(4)
? 0
: ((Int32)Convert.ChangeType(
r1.GetValue(4),
System.Int32
))
}
)
),
new Object[] {((Object)“London”)}
),
kvp => kvp.Value != null
),
kvp => kvp.Key,
kvp => kvp.Value
)
),
Queryable.AsQueryable(((DbQueryProvider)Query(Test.Customer).Provider).Execute(
new QueryCommand<Customer>(
@”SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
FROM [Customers] AS t0
WHERE (t0.City = @p0)”
,
new String[] {“p0”},
r0 => new Customer() {
City = r0.IsDBNull(0)
? null
: ((String)Convert.ChangeType(
r0.GetValue(0),
System.String
)),
ContactName = r0.IsDBNull(1)
? null
: ((String)Convert.ChangeType(
r0.GetValue(1),
System.String
)),
Country = r0.IsDBNull(2)
? null
: ((String)Convert.ChangeType(
r0.GetValue(2),
System.String
)),
CustomerID = r0.IsDBNull(3)
? null
: ((String)Convert.ChangeType(
r0.GetValue(3),
System.String
)),
Phone = r0.IsDBNull(4)
? null
: ((String)Convert.ChangeType(
r0.GetValue(4),
System.String
)),
Orders = Enumerable.ToList(lookup1.get_Item(r0.IsDBNull(3)
? null
: ((String)Convert.ChangeType(
r0.GetValue(3),
System.String
))))
}
),
new Object[] {((Object)“London”)}
))
}))
)


Now we’re cookin’ with GAS!  This is the exact LINQ expression that can be run to execute the query and produce the results.  Notice how the code directly calls the DbQueryProvider.Execute method that takes the command text, parameters and a function that converts a DbDataReader into a sequence of objects.  That function is described inline with the rest of the LINQ expression.  Yes, you can do that.  That’s how all those other LINQ operators work.


Also notice how I cheat with expression trees.  The process to create the lookup table requires variables, assignments and statement sequences; all things not currently possible with expression trees — or so you thought.  The variable is really a parameter to a lambda expression.  If you directly invoke an inline lambda expression you basically create a variable that is in scope over the body of the lambda.  Next I call a method on ExecutionBuilder called Sequence.  This lets me simulate having statement sequences.  All the expressions are evaluated in order because the results are turned into an argument array.  The Sequence method simply returns the last value in the array.  The last cheat is the assignment.  While there is no assignment operator in the expression tree (at least not yet), it is possible to pass a parameter to a method as a by-ref argument.  Yes, this is ugly and causes side-effects, but that’s exactly what I wanted.  (Until we get support for statements sometime in the future!)  The ExecutionBuilder Assign method take a ref argument and a value and simply assigns the value to the ref argument. 


When the query is finally executed, all the data is retrieved up front in only two queries.

SELECT t0.CustomerID, t1.Test, t1.CustomerID AS CustomerID1, t1.OrderDate, t1.OrderID
FROM [Customers] AS t0
LEFT OUTER JOIN (
SELECT t2.CustomerID, t2.OrderDate, t2.OrderID, 1 AS Test
FROM [Orders] AS t2
) AS t1
ON (t1.CustomerID = t0.CustomerID)
WHERE (t0.City = @p0)
— @p0 = [London]

SELECT t0.City, t0.ContactName, t0.Country, t0.CustomerID, t0.Phone
FROM [Customers] AS t0
WHERE (t0.City = @p0)
— @p0 = [London]


 


Well that’s it then.  Kudos, if you’ve read this far you really are trying to understand how to build your own IQueryable provider.  Either that or you are up late at night with insomnia trying to find something softer than a hammer to put yourself to sleep with.


Have fun with the new code!


 


Note: all sources are made available under the Microsoft Public License (MSPL)

Query12.zip

Comments (17)

  1. This is the twelfth in a series of posts on how to build a LINQ IQueryable provider. If you have not

  2. Pop Catalin says:

    This is more than awesome, I was expecting this since I saw your video about Linq on channel 9. Thank you.

  3. Frans Bouma says:

    Nice to see you blogging again 🙂

    I wondered why you would execute the orders first and the customers second. The thing is that if you execute the customers first, you can create a query for the order like:

    select .. from orders where customerid in (@cid1, @cid2…, @cidn);

    given a threshold: if the # of parents * # of pk fields is below a threshold, you can use that query, otherwise you revert to an exists query (instead of a join). this is more efficient. The side effect is that you have to create hashes yourself and merge the child objects within the parent objects using own hash finders, but that’s not that hard to do (and if you really want, you can use linq to objects joins though I’m not sure about the performance of that related to hash-matching.)

    Haven’t looked at the code yet, did you include tree rewriting for nested groupjoin + defaultifempty + where clauses as well? That’s one tar-pit I still have problems with (it works, almost always, but edge cases still fail).

  4. Andy Pook says:

    Firstly, I really appreciate you putting the huge amount of time and effort into putting this together.

    I had been feeling pretty guilty in not providing any feedback. So here you go…

    I’ve needed to make a change in order to get the tests to run.

    DbQueryProvider was throwing a "There is already an open DataReader associated with this Command which must be closed first." exception. I’m unsure how you were able to avoid this.

    I’ve changed this method to dispose of the cmd and reader. Now all tests pass. Here’s my version (the side effect is that the Project method is no longer used).

    public virtual IEnumerable<T> Execute<T>(QueryCommand<T> query, object[] paramValues)

    {

    using (DbCommand cmd = this.GetCommand(query.CommandText, query.ParameterNames, paramValues))

    {

    this.LogCommand(cmd);

    using (DbDataReader reader = cmd.ExecuteReader())

    {

    while (reader.Read())

    yield return query.Projector(reader);

    }

    }

    }

  5. Chris Cyvas says:

    Matt,

    I’ve really enjoyed this entire series and I’ve learned a lot from it. Thanks so much for putting it together – yes, I was one of those folks who has been using it wholesale in a project of mine – but I think this adds a significant and helpful dimension to the project as a whole. Well done.

    Chris

  6. Chris Cyvas says:

    Ok, now that I’ve downloaded this and looked at a few key areas for me – I just want to say – FREAKIN’ SWEEEET!

    Thanks a ton! This is awesome!

  7. Hi, Matt,

    Welcome back!

    And thanks for packaging your classic IQuerable implentation into the IQToolkit.

    –rj

  8. mattwar says:

    Frans,

    I execute the query for the orders first so that the function that constructs customers can be fully specified.  Unless I invent some other magic the only way to assign a value to a property in an experssion tree is during object construction.  If I were to query for customers first, I’d have to read all the customers in order to figure out the ID’s and then construct the query for orders, and then go back an assign orders to their already loaded customers. It is certainly doable.  It would just take more time.  Maybe I’ll try that next, or maybe one of you will and I’ll add it to the toolkit.

  9. mattwar says:

    Andy,

    Your solution to disposing the DataReader is great.  I’m not sure why you are getting the errors though.  If the DataReader is fully read is diposes of itself automatically.  I also am using the MultipleActiveResultSets property in the connection string for SQL Server and that may mask a problem.  

    However, by putting the yields into the same method body as the call to ExecuteReader you have turned the execute method in a deferred execution method.  If you change the Orders property on Customer to IEnumerable<Order> from List<Order> you’ll actually get a deferred query assigned to the property now.  That may be cool, but it is unintended.  That’s why the DeferredExecute method exists, so the ExecutionBuilder can choose between them based on your policy.  If you switch back to the use of the Project method and simply put you using inside it instead you’ll get to the correct behavior.  

  10. Frans Bouma says:

    Matt,

    The population indeed has to be postponed till all children are populated (recursively), though it would offer a great optimization option, especially in multi-branched graphs with multiple levels, because you can avoid fetching subgraphs/branches if there are no parents.

  11. Andy Pook says:

    I’ve got Northwind configured in a server rather than using a path based connectionstring. I’d missed out the MARS part. Reverting to code and adding MARS "fixed" the problem.

  12. mattwar says:

    Frans,

    In some circumstances you are correct. A better argument, though, is that by using a ‘keyset’ technique (which I’ve labeled the one you describe) you avoid the large joins that occur when you have deep nesting, which causes excessive read locks in the database.  

  13. kfarmer says:

    WQF — Wayward Query Foundation

    🙂

  14. Molden Molsen says:

    What a way to start the day. I haven’t read the earlier posts, This issue is good material to follow-up with

  15. nefimov says:

    Matt, in this toolkit i doesn’t find realization of the GroupJoin operator (left outer join in SQL terms). May be i missed something?

    Now, i’m trying to solve this problem, so if it’ll be interesting for you, i could send my solution.