Performance and Design Guidelines for Data Access Layers

Many problems you will face are actually the building data access layer, sometimes thinly disguised, sometimes in your face; it’s one of the broad patterns that you see in computer science – as the cliché says: it keeps rearing its ugly head.

Despite this, the same sorts of mistakes tend to be made in the design of such systems so I’d like to offer a bit of hard-won advice on how to approach a data access problem. Mostly this is going to be in the form of patterns/anti-patterns but nonetheless I hope it will be useful.

As always, in the interest of not writing a book, this advice is only approximately correct.

The main thing that you should remember is that access to the data will take two general shapes.  In database parlance you might say some of the work will be OLTP-ish (online transaction processing) and some of the work will be OLAP-ish (online analytical processing).   But simply, there’s how you update pieces of your data and how your read chunks of it.  And they have different needs.

At present it seems to me that people feel a strong temptation to put an OO interface around the data and expose that to customers.  This can be ok as part of the solution if you avoid some pitfalls, so I suggest you follow this advice:

1. Consider the unit of work carefully

There are likely to be several typical types of updates.  Make sure that you fetch enough data so that the typical cases do one batch of reads for necessary data, modify the data locally, and then write back that data in a batch.  If you read too much data you incur needless transfer costs, if you read too little data then you make too many round trips to do the whole job.

You may have noticed that I began with a model where you fetch some data, change it locally, and write it back.  This is a fairly obvious thing to do given that you are going to want to do the write-back in probably a single transaction but it’s important to do this even if you aren’t working in a transacted system.  Consider an alternative:  if you were to provide some kind of proxy to the data to each client and then RPC each property change back to the server you are in a world of hurt.  Now the number of round trips is very high and furthermore it’s impossible to write correct code because two people could be changing the very same object at the same time in partial/uncoordinated ways. 

This may seem like a silly thing to do but if the authoritative store isn’t a database it’s all too common for people to forget that the database rules exist for a reason and they probably apply to any kind of store at all.  Even if you’re using (e.g.) the registry or some other repository you still want to think about unit-of-work and make it so the each normal kind of update is a single operation.

Whatever you do don’t create an API where each field read/write is remoted to get the value.  Besides the performance disaster this creates it’s impossible to understand what will happen if you several people are doing something like Provider.SomeValue += 1;

2. Consider your locking strategy

Implicit in the discussion above is some notion of accepting or rejecting writes because the data has changed out from under you.  This is a normal situation and making it clear that it can and does happen and should be handled makes everyone’s life simpler.  This is another reason why an API like Provider.SomeValue = 1 to do the writes is a disaster.  How does it report failure?  And if it failed, how much failed?

You can choose an optimistic locking strategy or something else but you’ll need one.  A sure sign that you have it right is that the failure mode is obvious, and the recovery is equally obvious. 

I once had a conversation with Jim Gray where I told him how ironic it was to me that the only reason transactions could ever succeed at all in a hot system was that they had the option of failing.  Delicious irony that.

Remember, even data from a proxy isn’t really live.  It’s an illusion.  The moment you say “var x = provider.X;”  your ‘x’ is already potentially stale by the time it’s assigned.  Potentially stale data is the norm, it’s just a question of how stale and how do you recover. That means some kind of isolation and locking choice is mandatory.

3. Don’t forget the queries

Even if you did everything else completely correctly you’ve still only built half the system if all you can do is read and modify entities.  The OLAP part of your system will want to do bulk reads like “find me all the photos for this week”.  When doing these types of accesses it is vital to take advantage of their read aspect.  Do not create transactable objects just bring back the necessary data in a raw form.  Simple arrays of primitives are the best choice; they have the smallest overheads.  Do not require multiple round-trips to get commonly required data or the fixed cost of the round trip will end up dwarfing the actual work you're trying to do.

These queries are supposed to represent a snapshot in time according to whatever isolation model your data has (which comes back to the requirements created by your use cases and your unit of work).  If you force people to use your object interface to read raw data you will suffer horrible performance and you will likely have logically inconsistent views of your data.  Don’t do that.

One of the reasons that systems like Linq to SQL were as successful as they were (from various perspectives I think) is that they obeyed these general guidelines:

  • you can get a small amount of data or a large amount of data
  • you can get objects or just data
  • you can write back data chunks in units of your choice
  • the failure mode for read/writes is clear, easy to deal with, and in-your-face (yes, reads can fail, too)

Other data layers, while less general no doubt, would do well to follow the same set of rules.