DLinq (Linq to SQL) Performance (Part 5)

This posting is the last of what I had planned in this series but I think there are likely to be questions, especially when Orcas Beta 2 is more widely available so we're likely to talk about this some more.

First let's talk about the result I got and what it means.  On my particular hardware I was able to achieve 93% of the throughput of the underlying provider. 

 

How did I do this?

[7/17: Note: I just realized this could be read not how I intended.  I mean 'how I did this' from the perspective of how the benchmark is built and how is it that the benchmark could be expected to achieve such a result. I did not write *any* of the Linq code myself, I only gave them some ideas to help improve this result.  Matt didn't ask me to clarify this but he would have been more than justified if he had. :) ] 

Well think about it, even though my test case is designed to magnify Linq overheads (because there is no business logic to do anything with the data and the data is all local and in the cache) what we've done by compiling the query is basically to remove most of the Linq overhead entirely.  It's almost cheating except for you can do it too, and in meaningful, cases.  In fact, arguably in the most common and important cases where you most need it you'll be able to get the best performance.

 

What are the major steps in running a query?

  1. The expression tree that represents the query has to be created (var q = from etc...)
    • You could avoid this even if you don't compile the query by just saving your query somewhere reusable
    • If you weren't using Linq you'd have to just execute a command with parameters
  2. The expression tree has to be converted to SQL
    • The query can have parameters which in turn appear in the generated query
    • The actual parameters are applied when the query is executed
    • We make this a one-time cost by compiling the query once and saving the resulting SQL
  3. The query has to execute
    • The cost of this step is the same with Linq and without since the query looks very similar
    • SQL server can use the same plan for the query since it looks the same each time
    • This cost has been minimized in this test case by running the same query and keeping the database local
  4. The results of the query have to be turned into objects
    • In the May CTP this was done with reflection, however, now this is done by creating a custom method with light-weight code generation that does the object creation and data copying exactly the way you would do it by hand
    • Since the columns that come back from the query are the same regardless of values of any parameters in the query this custom method can be re-used
    • Instead of paying a cost to create the method every time you pay it once
    • You still have to pay the cost of invoking this method via a delegate on each row, if you do it manually that code is inline so there is no function
  5. Whatever you do with your data
    • In this example it's basically nothing (one add operation)

 

So what's left?  

We made #1, #2 and #4 one-time-costs.  #3, and #5 we have to do in both cases no matter what.  So what's the overhead?  Not a whole lot, some checks to make sure we really can use the saved versions of everything and then the cost of calling the delegate.  The reason the cost is as high as 7% is because so little is happening in steps 3 and 5.  In real cases those steps would tend to be the bulk of your cost.

When can you get this benefit?  Any time you are running the same (parameterized) query -- which is very often.  Just as prepared statements and stored procedures are common/popular in regular SQL compiled queries should be popular in Linq to SQL.  The most critical queries of your application can probably be compiled.  Others might not be worth the hassle.

 

What about those insert and update cases?

In the Linq world, the update looks like a select, some data changes, followed by an update.  I used the very same select statement and I arbitrarily updated the first dozen or so rows with a trivial update (I added 1s to a date field). 

             var fq = CompiledQuery.Compile
                (
                    (Northwinds nw) =>
                            from o in nw.Orders
                            select o
                );

            int i = 0;
            for (; i < updateruns; i++)
            {
                using (Northwinds dc = new Northwinds(conn))
                {
                    int j = 0;

                    foreach (Orders o in fq(dc))
                    {
                        if (j++ > updatebatch)
                            continue;

                        o.OrderDate = o.OrderDate.Value.AddSeconds(1);
                    }

                    dc.SubmitChanges();
                }
            }

Note that I reported times for both the compiled case and the non-compiled case.  That's because you can either compile the select part of the update or not.  Depending on the frequency of actual updates you might find it worthwhile or not.  Again this is a very dumb example designed to magnify Linq overheads.

 

Why did I get such outstanding performance?

The alternative code looked like this:

     
    // also runs in a loop updateruns times, not show here

    StringBuilder sb = new StringBuilder();

    while (dr.Read())
    {
        OrderDetail o = new OrderDetail();

        ... populate the fields...

        if (j++ > updatebatch) // updatebatch size was 10 in my test case
            continue;

        o.ShippedDate = o.ShippedDate.Value.AddSeconds(1);

        sb.AppendFormat("update Orders set ShippedDate = '{0}' where OrderID = {1}\r\n", 
               o.ShippedDate.ToString(), 
               o.OrderID);
    }
 




    // execute the query in the stringbuilder

That is all pretty simple stuff.. it's actually cheating a little because Linq to SQL will make sure that the data hasn't changed since it was read and I don't bother with that. However all of this is trumped by the fact that I didn't bother using prepared statements (but I did executed my updates in one batch) and Linq to SQL automatically made a prepared statement for doing the update and as a result SQL was able to process it better.

 

Could you do that yourself?  

Yes.  Would you?  Maybe.  Or you might use a stored proc to do the update for you.  At that point my guess is that you would break even as you'd be back to doing exactly what Linq to SQL does.  Isn't it strange that we're talking about what you have to do to the no-Linq case in order to get the speed you get with Linq by default?  I think that's a good sign.  But see the overall conclusions at the end.

 

What about the Insert case?

The insert test case gets its performance boost basically the same way except it's an batch of insert statements rather than updates and of course there is no select. 

     using (Northwinds dc = new Northwinds(conn))
    {
        for (int j = 0; j < insertbatch; j++) // 10 items
        {
            Categories cat = new Categories();
            cat.CategoryName = "dummy_category" + j.ToString();
            cat.Description = "Description... Description... ' Description...Description...Description...Description...";

            dc.Categories.Add(cat);
        }

        dc.SubmitChanges();
    }

The batch insertion code in my test case (without Linq) looks nearly identical to the update case.

         sb.AppendFormat("insert into Categories (CategoryName, Description) values('{0}', '{1}')\r\n",
            "dummy_category" + j.ToString(),
            "Description... Description... Description...Description...Description...Description...");

So you can expect the Linq version will perform better for the same (artificial) reason as in the update.

 

What's my final word?

On selects you can pretty much make the Linq overhead vanish if you have a repeatable query pattern, which is a pretty common thing.  That's great news.

On inserts and updates, my test cases weren't especially great and the main thing they illustrate is that good connection management and prepared statements dwarf the other costs in simple insert/update cases.  The good news there is that Linq gives you both for free.

Despite the fact that I'm pretty handy with SQL it took me a LOT longer to write the no-Linq version of these tests and I'd much rather maintain those than the reverse.

None of the times I reported have anything to do with what actual applications with normal latencies and data logic will experience.  In those cases the results show that you're likely to see little to no difference between using linq and not using linq (if you compile etc.) which is also a great result.

Overall I'm very pleased.  I hope you will be too.