ObjectSpaces: The Devil is in the Demand

On object persistence I’m object persnickety.  I’ve looked at this problem space quite a bit over the last five years.  I’ve built systems and frameworks galore that provide a multitude of different styles of object-relational mapping with a variety of different object representations.  I’ve coined the phrase ‘object span’ to describe the ability to front load a graph-based query with enough information to cherry pick at the best bits of data so data retrieval is optimized to the application.  I’ve even written about that here.  Yet I’ve never felt one hundred percent right about any of it.  Not the span exactly, but the whole concept of systems designed to retrieve segments of a data graph into the client space, with interceptors (or swizzlers) that fault in more segments upon demand.  I’ve come to the conclusion that demand loading is just a terribly bad idea.

 

Now the way I look at it is that data stored in the server is of course the data of record.  Any fragment of that data that you pull down to another machine or another process space is by its very essence just a copy of the original, and it’s a stale copy to boot.  Unless you enforce pessimistic concurrency by locking out anyone else whenever you even glance at a record, you are always dealing with stale data.  If the fragment of data you bring back is actually construed of records from a variety of tables through relationships and collections available on your objects then you have any even worse problem.  You might as well lock the whole database and serialize any and all interaction with it.  That is, if you want to keep up the charade that your client copy of data is somehow accurate.

 

That’s what you’d have to do, or a close approximation, if you wanted any reliability when it comes to faulting in data that was not retrieved with your original query.  But this is exactly what we keep on thinking we can do when we design our objects with built-in behaviors that try to govern constraints over the data and its relationships.  Certainly, with perfect knowledge, this is absolutely the right thing to do.  When designing objects for your application that only exist in memory this is possible, yet for objects that exist logically in the database and only transiently in memory any such rules baked into the object just can never honestly be upheld.  You just don’t have all the data.  You don’t know the current state.  You don’t know that a collection of related objects has changed membership.  How can you control that membership in code on the client, unless you’ve frozen out every other possible interaction?

 

Sure, you can wrap your client code in a transaction and rely on optimistic concurrency to throw back anything that fundamentally violates these constraints.  But these violations are evaluated on the server, not in your code.  Your code will try to judge these things long before submitting the lot to the back end.  The back-end can only catch constraints that you missed; it can’t help you undo constraints that you enforced incorrectly with stale information.

 

It seems self evident, therefore, that you can only perform constraint checking over data you know you have, and that is all retrieved in a consistent state, at the same time.  This might make you think you can add constraints to property accessors that make certain you don’t modify a field that would violate some simple constraint check.  This would make sense for data type constraints, like the range of values legal for a particular integer, etc.  But this breaks down if the constraint evaluates against other data as well, even if the data was originally all consistent.  You can’t just restrict piecewise modifications to the data.  You’ve got to leave the programmer some wiggle room to work the data before submitting it as valid.  You see this all the time with poorly written data-entry systems.  The ones that throw a fit if you enter an invalid value, and don’t let you progress until you do.  But it was only invalid, given the state of another field, which you can’t get to yet.  These kinds of things drive me batty.  The developer has to be able to choose when to apply the constraints to check for validity.  It should not be automatic.

 

What I’m getting at is that the only place with enough perfect knowledge to enforce constraints beyond simple data-type constraints is the place that has all the information about the data.  In a client-server system, the only place that is, is the server.  This even pertains to code running on the server, and even in the same process as the server, if the data is a stale copy of the official data.   Therefore there is no use in pretending that your client objects are something more than they are.  They are a working copy of information that you can make no inferences about validity until the whole wad is packaged up and sent back to the server.

 

Still, you might think it reasonable that your client objects employ demand loading features just so you don’t have to bake into the system what data is retrieved all at once.  I agree, this is a terribly bad thing.  Applications tend to have a variety of data usage patterns, so optimizing for one generally makes all the others unbearable.  But by doing this you are implying that data fetched now is intrinsically the same and as good as data retrieved in the original request.  Yet, even though you could have inferred some sort of consistency with the data when it was retrieved all at once, you can no longer do this with data this is retrieved at just any-old-when.  Demand loading perpetuates the myth that the client objects somehow represents accurate proxies into the server data.

 

If this doesn’t make the hair stand up on the back of your neck, think about all the potential security problems that go with objects that inherently carry around context that allows it to reach back into the database to get more data.  If you ever intended to hand off any of these objects to some piece of untrusted code, you can just forget it.  You’d be handing off your credentials along with it.

 

So in reality, you don’t want to make any claims about the integrity of your client objects.  You really just want these objects to represent the result of your query and nothing more.  You still want to represent relationships to other objects as properties and collections, but you don’t want these in-memory references to imply anything that might be construed as consistency with the backend.  You want this data to simply represent the data you asked for from the server at the time that you asked for it.  Beyond that, since you can’t accurately apply constraints on the client, the user of these objects should be able to modify them to their hearts content, changing values, adding and removing objects from collections, and none of it should mean anything until you choose to submit some of these changes to the back-end.

 

So to summarize:

1) span good, demand load bad.

2) client objects are just data, not behavior

 

Of course, this is not the ideal that Object Spaces is offering.  You can do it however, by just ignoring the demand-loading data types and instead stick to simple generics for collections:  List<T> and plain object references for 1-to-1 relationships.

 

Let me know what you think.

 

Matt