Some thoughts/advice about databases and caching

As usual, kindly allow me to speak in rough terms so that I might be brief keeping in mind that there are exceptions to what I write below.

I think the most important thing to remember about serving data from a database is that, as a practical matter, you can’t realistically serve data that you know to be absolutely current. Even if you somehow get the data from the database and magically format it up in zero time there are enough delays along the way that it might have changed by the time your end user sees it and tries to do something with it.

Your user is going to be seeing stale data, the only question is, how stale.

Not only is the data going to be stale, it’s also likely to lack self-consistency.  For instance, unless you go to very great pains (and cost), when your user asks for “page 2” of a dataset via the “give me page 2” button most algorithms for computing “page 2” might return duplicate items, or might miss items entirely if something is added or removed that would have been on “page 1”.

Yet despite these theoretical and practical limitations, real people interact with real databases every day and have satisfactory experiences.  This comes about because in practice good database analysts think carefully about the sorts of queries that will be made against their database, and the elementary business operations that will change the contents of that database, plus the nature of the user interface that will be doing the asking.

Now at last we come to the performance part of this discussion  :)

While you’re thinking about the sort of experience that you want your user to have, it’s vital to consider what sort of tolerance you can afford to staleness and correctness and how to best leverage that design to get the best overall experience (including performance) for your customers.

I’m sure many people are screaming about trading off correctness for performance at this point but I’ll remind you that delivering 100% correct data to your customers is impossible as a practical matter – nobody could afford such a system.  So from there good engineering forces you to think, “Since perfection is far too expensive, what guarantees should I be making?”

At the core of the database the kinds of consistency guarantees that you want to make will translate into factors like the size of your transactions and the breadth of allowed queries.  Having made those choices you are then in a position to select a suitable isolation level to facilitate them.

The choices you make in the next layer, your business objects, tend to mirror the choices made at the database level.  At this level you might concern yourself with caching choices to avoid a lot of round-trips to the database.  Avoiding round-trips is what caching is all about.

Some things to consider:

Every caching strategy is based on not asking the database a question it otherwise would have had to answer which ultimately increases the staleness of the data you serve.   Is a time-based staleness guarantee suitable for your application?  This approach is actually a lot more versatile than it might seem.

What cache-hit ratio do you need to have for your cache to be a net win?  Remember SQL itself caches the underlying data which partly competes with any caching you might do yourself.  Given the mix of queries you expect to arrive at your business layer, what hit rate would you need to actually see a benefit? 

Caching is no substitute for a well considered indexing strategy on the server itself.  Enormous caches should be the province of the database, their presence elsewhere often indicates that not enough attention has been paid to the schema and indexes.

Is it fairly easy to detect when your cache is “too stale” or does that require a complex computation?  If it requires complex calculations you might want to try something simpler, or you might want to try reversing the logic so that the server notifies you of potential changes – but beware, complicating the back end might add friction to the whole system making it very difficult to do large batches of updates for instance.  Complex notification systems are frequently not worth the effort or the expense and of course they do not necessarily provide a 100% up-to-date data guarantee either.

One caching policy is not likely to be suited to all your applications needs.  For instance, you can probably have an astonishingly large time tolerance on data you use for things like converting from a “State ID” to a “State Name” but I could hardly recommend the same for information associated with an “Order ID.”

A cache that holds “partly cooked” data which can be assembled or otherwise quickly processed to create “fully cooked” results is often preferable.  The partly cooked results can be used in a greater variety of queries and often take less space, increasing the effective size of the cache.   In web servers this means caching fragments of pages, or caching the raw data from which the HTML can be quickly formatted, in preference to caching entire HTML pages.

If a cache will be a key feature of your business layer be sure to optimize the underlying schema so that it is suitable for the cache populating queries rather than potentially optimizing for accessing singleton objects.  This is akin to designing DRAMs so that they are best at filling the CPU's cache lines and not optimizing for byte-wise access.

Summary

Select a simple caching mechanism that produces a predictable and satisfactory customer experience, does not add much in the way of complexity, and co-operates well with the underlying schema.

Hold the mayo :)