Introduction and Disclaimer
Regular readers of my blog are already familiar with my goal to provide brief and useful information that is approximately correct and that illustrates some key truths. Most of the time my articles are not authoritative and that is especially true in this case. I am certainly not an Official Microsoft Authority on databases and data systems, I just have a good bit of experience in that area, and I wanted to convey some things I learned that I thought were important, and that I’ve never seen assembled as a whole before, so I’ve written this article. This article uses Linq to SQL for its examples but I think it is actually more broadly applicable, with due caution.
Performance in Many Tier Systems
Again if you read my blog you’ll know that I always talk about the importance of measuring performance in context. This is especially important in systems with multiple tiers because making a choice in one tier can profoundly impact the consequences in other tiers. For instance, a client-side cache might take a lot of load off of the middle tier. Sounds great right? Oops, you of course remembered that if you cached the contents on the client then they are going to be at least a little of out of date right? You’re ok with that? Or did you add a scheme to periodically recheck validity. Oops, now you have more traffic to the middle teir again. No, wait, you have a nice periodic discard policy where you only keep local data for a known interval? Ah yes, but now your cache contents aren’t necessarily self-consistent because they were fetched with different queries at different times and unified to create the cache. Does it ever end?
There is a dance here, and it is a complicated one. Only by understanding how all the dancers play together in your system can you truly create systems that have solid correctness characteristics and good performance. And it’s a bad idea to look at the performance of any one of the dancers independently.
I break it down into a fairly small list of considerations, and these are of course deeply entangled.
- Unit of Work
You might wonder why I’m not mentioning things like network, schema, and so forth… It’s because I think about them in the context of the bigger phenomena. And maybe you’re wondering what this could possibly have to do with Linq to SQL but don’t worry we’re going to get there by the time we talk about unit of work.
Even when we’re not talking about databases I often say “Locality is everything” and it’s possible that it is even more true in the database world (don’t nitpick me on “degrees of truth”, it’s only a figure of speech ^_^). When you’re talking about a data system, bad locality generally translates directly to more disk operations which in turn translate into torpedoed performance.
What do we do about it? Two big things: Schema and Indexes. Maybe that’s really two sides of the same thing but let me separate those concerns for a moment.
We create schema in a manner that is consistent with the data we intend to represent and so that logically related concerns will tend to be physically together. If we normalize well we also tend to find that a single logical fact tends to be represented in a single physical location. All of this bodes well for locality.
But now we’re faced with a problem. One organization of the data is frequently not enough to support all the operations that we intend to perform on the data. If we want to look up only by (e.g.) ID, one ordering is fine but the moment that we want to look up also by (e.g.) surname then we need to organize the data in more than one way. That is where indexes come in to play.
Trust me, I’m going somewhere with all of this.
To get multiple orderings of the same data you could create multiple tables with the same data arranged differently. That would work but it would be highly inconvenient as you would always have to be choosing the flavor of the data to access when you queried the data and you would have to update multiple copies of the data whenever you made changes.
Indexes let you automate this. Indexes are your way of telling a database that you want multiple copies of your data, ordered differently, and any time you do an operation that updates the main copy you want all of the alternate copies also updated automatically in the same transaction.
This is actually a profound statement. Remember we added indexes because there were some questions we could not answer without looking at a lot of data (bad locality) and the price we pay is that now when we make changes to one spot in the data we actually have to propagate that change to many places atomically (reduced locality).
In the final analysis an index isn’t much more than a second, or third, etc. copy of the table with some of the columns reordered and some removed entirely.
Locality means we get just the data we need, that it’s organized in such a way that we do not have to sift though vast volumes of data to get to what we want, and that we can make surgical changes to the data in support of the things our system needs to do as it runs. Good locality translates directly into good performance.
Isolation is a notion that is cloaked in mystery, perhaps unnecessarily. I’ve met many developers that know what a transaction is but far fewer that know about isolation levels and fewer still that understand how these things are entangled. In short, isolation is about giving every program using a data system the illusion that it is the only user. Levels of isolation, roughly, describe how imperfect that illusion is going to be.
OK if you’re still with me at this point you must be wondering what any of this could possibly have to do with performance and even more wondering why on earth I am now talking about Isolation, a concept that is understood even less well than Locality.
I’m glad you asked
The first thing you should know is that the better your data locality the easier it is to create the illusion of isolation. The simplest case is two clients looking at two completely unrelated sections of a database – they have no overlap in their operations whatsoever and so isolation is easy. Now of course the more data you access the greater the likelihood that you will overlap with someone else and some isolation technique is going to be necessary to preserve the illusion. If your data has great locality then you’ll be able to make nice tight queries minimizing your isolation needs.
The second thing you should know is that maintaining isolation has a cost, like everything else. Depending on the technique used it can have both direct costs and indirect costs. But that’s really abstract so let’s be more concrete to illustrate these costs with a specific example.
Let’s suppose I have a simple database with just one table in it. It’s a set of account numbers, names, and balances. Now just one table isn’t enough to really model a bank with necessary auditing and so forth but this example is already complicated enough to show some isolation concerns and how they turn into performance problems. Furthermore let’s say the isolation required is one of the most basic choices “READ_COMMITTED” meaning I’m never allowed to observe someone else’s data if they have not yet committed their transaction – so no intermediate results.
Now let’s suppose I’m a big customer of this bank and I have 100 accounts, conveniently numbered 1000, to 1099. Some of the accounts are in my name and some of them are in my wife’s name. I want to know the total amount of money in the accounts in my name. In SQL it would look like this
where account.id > = 1000 and account.id <= 1099 and account.name = ‘Rico’
The situation doesn’t get much easier than that. We could assume there is a nice index on the account table by ID so that all those accounts are (nearly) contiguous and with one scan of just my slice of the data we can get the answer. That’s about the best locality we could hope for.
Now you can imagine that you are the little database engine, you go and read account #1000, check the name, find that it’s Rico, and add the balance. Move on to #1001, check the name, add the balance. Chug chug chug.
Now let me complicate things just a little bit. While this is happening, and we’re on account #1050 (chug chug) another user comes along and deposits money into one of my accounts. Let’s say its account #1060. Is this a problem? Well no it isn’t. Since I haven’t yet read the contents of account #1060 when I get there I will find the new balance and all is well. Or is it?
Here’s a subtle point: the total that I get isn’t guaranteed to be the total that I would have gotten if the whole query had run at the instant that that it started (you could imagine such a system but that’s unusual) the only guarantee is that we will get some total that represents the sum of the balances as they could have existed at some moment where all the data was committed. So here we’ve created valid total, an especially interesting one because it shows the balance at the instant the query finishes.
Great, so we can write into areas we have not yet read without any trouble.
In fact we could even do something a little more complicated. Suppose we did a transfer of $100 from one of my accounts, #1060, to another account, #1070. We’re still in good shape because neither of those two accounts has yet been read and so the sum will be computed correctly.
Hey, this isolation thing doesn’t seem very hard so far! I wanted to subtract money from #1060 and put in #1070 and everything is great. But what if I had wanted to put the money in account #1030?
Now I have a problem, #1030 has already been read. If I allow the money to be moved then the sum calculation will be off by $100 because the $100 is gone from #1060 which I have not yet read and it was missing from #1030 which I have read. Ooops.
What do we do about this? Well there are many approaches, I will pick one for this example, the key thing to remember is that all the approaches have costs.
You might think that what the database must do when the query begins is lock everything that is going to be read and prevent anyone/everyone from writing to those things. That could work, if only it could predict, accurately, what it is that will be read. Or it could lock more than will be read, hopefully not much more, that could work too. But keep in mind these things:
#1 predicting the future is hard (e.g. in this example, which rows belong to me and which belong to my wife?)
#2 we want to lock as little as possible so that as much can continue running as possible (e.g. all updates to my wife’s accounts would be fine since they do not affect the sum)
So we might end up at a different scheme – rather than guess the future, lets lock the past. When the transaction wrote to #1060, subtracting $100 that was just fine, so far. The contents of #1060 are still dirty so if the reader arrives there it must wait for the transaction to finish. Meanwhile, if the writer chooses to move the $100 to account #1070 all is well. #1070 has not been read yet, it isn’t locked, the writer moves the cash, commits the transaction and if the reader had been waiting for #1060 to finalize it will do so and the read operation can proceed. So perhaps the summing was paused for a moment but otherwise everything went well.
Notice that we already have taken a bunch of performance hits. We had to mark rows that were “dirty” and pending a transaction with some kind of write lock. We had to have readers checking the write locks and waiting (i.e. taking longer) if they encountered a lock. And we had to clean out the write locks whenever a transaction committed. But that was the easy case.
What if, after subtracting $100 from account #1060 we then attempted to add the $100 to account #1030. Lucky for us new “lock the past” isolation scheme says that row #1030, since it has already been read, now has a read-lock. That means we can’t write to it. But wait. #1060 has a write lock. So when the reader eventually gets to #1060 it will stop. It will not give up its read lock on #1030 so the write operation cannot complete, meanwhile the read operation on #1060 cannot complete and so the read long on #1030 well never be released.
You’ll notice I’ve managed to create a deadlock with just one writer in the system merely because of the need for isolation.
Now you can imagine other isolation methods that would not get a deadlock in this case but they have other costs and we’re not studying them. The point here is that there is a direct isolation cost, in any scheme, and now we’re about to see an example of an indirect cost.
Since the system is deadlocked, the database must now choose one of those two transactions (the reader or the writer) and abort it.
If it chooses the reader, then the write will be allowed to complete, the reader must retry the operation and then get the correct and accurate new sum based on the completed write transaction.
If it chooses the writer, then the reader may complete but first we must use the transaction log to restore the original contents account #1060 (i.e. the transaction does not commit, it aborts) and we get the correct sum. Meanwhile the writer must retry its operation.
Can you see indirect costs? Probably the most significant is that either the read or the write must be retried. Now these were both tiny cases but you can imagine if many accounts were implicated and if there was complicated math, sorting, etc. that redoing that read query could be costly and likewise the write query with many indices and if it spanned many tables could be rather complex, to say nothing of the business logic that might be required to redo the math. In this case it’s a simple add $100, subtract $100 but in a complex system it might be a tricky reservation computation, mileage credit and commission distribution – with auditing. Anything you have to redo is just wasted work.
Importantly, the above is a normal and natural part of using a database and nothing has gone “wrong” here and we can reach this vital conclusion. The more we read or write in one unit, the greater the amortized cost of isolation. The more we attempt to write in one unit the greater the chance that the entire write will have to be aborted (and then redone from scratch).
This leads us right into the next topic…
Unit of Work
More than anything else how much work you do in one chunk drives everything else. Sure you need good locality – via a proper schema and indexes – and yes you need to choose the right isolation technique, where you have options, but you can destroy a system by trying to do too much at once – and people do.
If you’ve been waiting for me to make the connection to Linq here it is: every data layer has to make choices about how much to read and when. It may seem like a good idea to do large amounts of pre-reading, and in fact when measured in isolation it may seem like you get better performance when doing so, but, like the prisoners’ dilemma, when composed with the other operations that are happening on the server you may find that making the best looking choice independently results in a poor situation for the system as a whole.
Should you make the choice that is best for one part of the system at the expense of the system as a whole? Probably not.
Linq to SQL is faced with very real choices around how much to read in one chunk and how much isolation to guarantee. Creating a solution where we appear to offer excellent but non-composable performance would simply be leading people down a path to disaster – and you know I always advocate the Pit of Success. The obvious way should work out well.
Once you’ve thought about unit of work you soon realize that you cannot afford to submit large transactions to your database – they are too likely to not commit successfully. So what do you do?
A classic thing to do is to break up those transactions into smaller pieces, but doing so creates a new set of intermediate stored results which must be valid. It’s a designed “workflow” for your uber-transaction where for instance every-line item in an order, or every 10 or something like that, is independently written and the order is flagged as “in flight.” In that world “in flight” orders are a normal healthy thing and your system has to be able to handle them appropriately – basically some intermediate states have become visible.
It sounds bad, but it’s essentially inescapable. The alternative is to *try* to commit gianormous transactions that really have no hope of *actually* committing with any kind of realiability. You may think that you’re going to get great performance by submitting all that work in one nice big chunk but you may find that all those savings are lost to the cost of all the extra retries you have to do to actually succeed.
And it gets worse.
If you have a client side component, like you do with Linq, then the data you have saved on the client might become “wrong” if something happens on the server that you don’t know about. You’ll need some kind of locking mechanism like Optimistic (what Linq uses) or Pessimistic. What this says is that before you write data back to the database you first re-verify that the data now in the database is still what it was before you made the update – if anything has changed out from under you then you throw an exception.
What does that mean? Well if you are writing a large amount of data and it has to be atomic that’s a bad thing because even if there wasn’t a database deadlock you still might find that some part of what you wrote has been altered. If you require that it all be as-it-was you may find you can never write anything, or that you often have to try 2, 3, 4, 5, 10, 50… who knows how many times to actually get the stuff to write.
What do you do about this?
We’re right back to the unit of work discussion. If you break those writes down into smaller chunks and make it so that you can write back in pieces – including some markers to show what is in flight and what is not – then you can “partly succeed” in your writes, even “mostly succeed” and when things go wrong because of a conflict you only need resolve those few records that actually did conflict and write them back to the database in your retry operation.
It is not wise to expect to successfully write thousands and thousands of rows in one operation and actually succeed on a production database under load.
Yes breaking those operations down will result in more round trips to the server and it may seem that such a thing performs more poorly (it will if measured alone) but those operations are much more composable with other things the server is going to be doing.
Your overall performance could be, almost certainly will be, A Lot Better (TM).
You’ll of course have to measure, in context.
One Last Warning
If you consider what I said, about the natural occurrence of failures in a database, then you’ll soon realize that it is *normal*, using Linq parlance, for db.SubmitChanges() to throw an exception from time to time. If you are trying to write a robust application with high reliability you need to think about that.
In addition to obvious things like, “the network went down”, “the database went down”, there are less obvious things like, “there was a deadlock”, “there was an optimistic lock conflict” that can and do happen. Those latter two things should be appropriately retried because *nothing bad has happened*. The strategy you choose, especially for cases where the optimistic lock failed, can have a profound impact on your performance and certainly you can’t just let those exceptions flow to the user. I think I can safely say that my mom doesn’t want to hear about how table X on connection A deadlocked with table Y on connection B.
If you’ve been reading carefully then you’ll see that it’s also “normal” for a foreach operation over a Linq query to fail from time to time – you need a retry strategy for those too to be fully robust.
Don’t get down on Linq though, those problems exist with all data solutions, the productivity benefits you get from Linq will go a long way to helping you to add the robustness you need in the areas you need it.
Don’t read “too much” at once. Don’t write “too much” at once. Handle deadlocks, they’re normal. Handle optimistic lock failures, they’re also normal. You should land in the Pit of Success.