ObjectSpaces: The Devil is in the Demand


On object persistence I’m object persnickety.  I’ve looked at this problem space quite a bit over the last five years.  I’ve built systems and frameworks galore that provide a multitude of different styles of object-relational mapping with a variety of different object representations.  I’ve coined the phrase ‘object span’ to describe the ability to front load a graph-based query with enough information to cherry pick at the best bits of data so data retrieval is optimized to the application.  I’ve even written about that here.  Yet I’ve never felt one hundred percent right about any of it.  Not the span exactly, but the whole concept of systems designed to retrieve segments of a data graph into the client space, with interceptors (or swizzlers) that fault in more segments upon demand.  I’ve come to the conclusion that demand loading is just a terribly bad idea.


 


Now the way I look at it is that data stored in the server is of course the data of record.  Any fragment of that data that you pull down to another machine or another process space is by its very essence just a copy of the original, and it’s a stale copy to boot.  Unless you enforce pessimistic concurrency by locking out anyone else whenever you even glance at a record, you are always dealing with stale data.  If the fragment of data you bring back is actually construed of records from a variety of tables through relationships and collections available on your objects then you have any even worse problem.  You might as well lock the whole database and serialize any and all interaction with it.  That is, if you want to keep up the charade that your client copy of data is somehow accurate.


 


That’s what you’d have to do, or a close approximation, if you wanted any reliability when it comes to faulting in data that was not retrieved with your original query.  But this is exactly what we keep on thinking we can do when we design our objects with built-in behaviors that try to govern constraints over the data and its relationships.  Certainly, with perfect knowledge, this is absolutely the right thing to do.  When designing objects for your application that only exist in memory this is possible, yet for objects that exist logically in the database and only transiently in memory any such rules baked into the object just can never honestly be upheld.  You just don’t have all the data.  You don’t know the current state.  You don’t know that a collection of related objects has changed membership.  How can you control that membership in code on the client, unless you’ve frozen out every other possible interaction?


 


Sure, you can wrap your client code in a transaction and rely on optimistic concurrency to throw back anything that fundamentally violates these constraints.  But these violations are evaluated on the server, not in your code.  Your code will try to judge these things long before submitting the lot to the back end.  The back-end can only catch constraints that you missed; it can’t help you undo constraints that you enforced incorrectly with stale information.


 


It seems self evident, therefore, that you can only perform constraint checking over data you know you have, and that is all retrieved in a consistent state, at the same time.  This might make you think you can add constraints to property accessors that make certain you don’t modify a field that would violate some simple constraint check.  This would make sense for data type constraints, like the range of values legal for a particular integer, etc.  But this breaks down if the constraint evaluates against other data as well, even if the data was originally all consistent.  You can’t just restrict piecewise modifications to the data.  You’ve got to leave the programmer some wiggle room to work the data before submitting it as valid.  You see this all the time with poorly written data-entry systems.  The ones that throw a fit if you enter an invalid value, and don’t let you progress until you do.  But it was only invalid, given the state of another field, which you can’t get to yet.  These kinds of things drive me batty.  The developer has to be able to choose when to apply the constraints to check for validity.  It should not be automatic.


 


What I’m getting at is that the only place with enough perfect knowledge to enforce constraints beyond simple data-type constraints is the place that has all the information about the data.  In a client-server system, the only place that is, is the server.  This even pertains to code running on the server, and even in the same process as the server, if the data is a stale copy of the official data.   Therefore there is no use in pretending that your client objects are something more than they are.  They are a working copy of information that you can make no inferences about validity until the whole wad is packaged up and sent back to the server.


 


Still, you might think it reasonable that your client objects employ demand loading features just so you don’t have to bake into the system what data is retrieved all at once.  I agree, this is a terribly bad thing.  Applications tend to have a variety of data usage patterns, so optimizing for one generally makes all the others unbearable.  But by doing this you are implying that data fetched now is intrinsically the same and as good as data retrieved in the original request.  Yet, even though you could have inferred some sort of consistency with the data when it was retrieved all at once, you can no longer do this with data this is retrieved at just any-old-when.  Demand loading perpetuates the myth that the client objects somehow represents accurate proxies into the server data.


 


If this doesn’t make the hair stand up on the back of your neck, think about all the potential security problems that go with objects that inherently carry around context that allows it to reach back into the database to get more data.  If you ever intended to hand off any of these objects to some piece of untrusted code, you can just forget it.  You’d be handing off your credentials along with it.


 


So in reality, you don’t want to make any claims about the integrity of your client objects.  You really just want these objects to represent the result of your query and nothing more.  You still want to represent relationships to other objects as properties and collections, but you don’t want these in-memory references to imply anything that might be construed as consistency with the backend.  You want this data to simply represent the data you asked for from the server at the time that you asked for it.  Beyond that, since you can’t accurately apply constraints on the client, the user of these objects should be able to modify them to their hearts content, changing values, adding and removing objects from collections, and none of it should mean anything until you choose to submit some of these changes to the back-end.


 


So to summarize:


1)      span good, demand load bad.


2)      client objects are just data, not behavior


 


Of course, this is not the ideal that Object Spaces is offering.  You can do it however, by just ignoring the demand-loading data types and instead stick to simple generics for collections:  List<T> and plain object references for 1-to-1 relationships.


 


Let me know what you think.


 


Matt


 

Comments (25)

  1. Intesting Matt!

    Let’s assume for a moment that demand load isn’t used. Would you still say "client objects are just data, not behavior"?

    Let’s assume that compositions (or aggregations if that’s the word you prefer) are always loaded in total, and loose associations are demand loaded. Also assume that this knowledge is used for when writing the behavior. Would you think that that changed the picture?

    Best Regards,

    Jimmy

    http://www.jnsk.se/weblog/

    ###

  2. Matt says:

    I generally think the idea of ‘in total’ is dubious to try to achieve. You often only want to see a portion of related objects in any given application. For example, if you related a customer and his orders together, using a collection to represent the customer’s orders. The idea of totality works okay for the average customer with only a few orders, but then there are a few with quite a large number, and you app might only need to be concerned with the most recent. So you wouldn’t want to imply that the orders collection always strictly refers to the entire set of orders for a given customer. This would preclude ever using that particular data class /shape to represent the customer and only a subset of orders. Which might be a common usage pattern.

  3. Good point, true… But if you weren’t allowed to make changes to the order without first loading the complete composition. That is, you can first fetch a list of customers, but to update one of them, you would have to demand load the orders (and in the same operation also the refresh the customer), then that might deal with half of the problem, or?

    The other half, that the orders often are too many for this being a good solution… Well, perhaps the orders that are of historic nature isn’t in the composite at all, but could be fetched via an association. The "active", or interesting orders are in the composition. I know, I tried to come up with a design – which some probably would call twisted – for avoiding the problem and there will certainly be situations when this isn’t as "easy". On the other hand, that is perhaps just the way I think it should be done. I mean, with specific design. General solutions are often too lame.

    🙂

    Best Regards,

    Jimmy

    http://www.jnsk.se/weblog/

    ###

  4. Edward says:

    Forgive me, I’ve never actually tried any of this.

    Would it not be possible to "check-out" the dataset of an object from the DB and set up a trigger so that a message can be sent to the client whenever some of the data behind an object is altered in the DB?

    The client code could decide whether or not to refresh the object state from the DB before it tries to demand-load another part of the graph.

    The user could be told "someone else has just changed this data, do you want to see what changes they have made", or the program could have some default behavior.

    It would be less overhead than pessimistic locking, but would reduce the overhead of the client breaking a constriant when submitting an update to a database that has been altered since the initial data was retrieved.

    You can’t guarentee that a client would get the message, so you still have to constraint check all the updates, but it might make things a bit smoother.

    With Yukon I’d have though such things were do-able since you can create new triggers within queries.

    I expect a lot of research as already been done on such a solution, is there any particular reason why it wouldn’t be effective?

  5. If I understand you correctly, the constraints problem is not solved with span loading.

    Suppose you have a business rule that says that the Customer.CreditLimit needs to be higher than the amount of all the Invoice.Total for the customer.

    Suppose you load an existing Invoice with its InvoiceLines and with the Invoice.Customer object (using span loading). You change some values in the lines. Then you compare the total with the CreditLimit, and it satisfies the rule, so you then persist the changes.

    If someone else changed the CreditLimit after you read it, then you constraint could be no longer satisfied but you won’t know.

    I see two ways of solving this. One is to do optimistic locking on the Customer object even if it’s not updated.

    The other is locking the Customer record by reading it again with a pessimistic lock, perform the check, and commit it.

    I don’t see how span loading can solve these scenarios.

    Regards,

    Andres

  6. An addition.

    Doing an optimistic locking in the Customer even if it’s not updated is not really what you want to do. You just want to use the value stored in the database when you are going to persist it, so the only good solution is the second one.

  7. Matt Warren says:

    You may have misunderstood me. I was making the point that you can not implement the constraints correctly, so there was no need to pretend to be a consistent replica of the server data. This means no demand loading, which leaves only spans as a way to query for a non-trivial section of a graph.

  8. Matt,

    OK, but spans combined with a pessimistic lock to check the constraints before updating the data, am I right?

    Also, if you do optimistic locking on data that you are not updating but you need to use in your constraint checking, then your constraints will be implemented correctly even with delay loading (if the data is different in the database, you’ll get an exception) but with a high price for the user that will need to deal with this exception.

    What am I missing?

  9. Frans Bouma says:

    While I agree with you that data in an object should be treated as a copy, and thus by definition a stale piece of data, you make a mistake along the way in your conclusions.

    If I load a customer object into memory, I know that it is a copy. If I then want to read its order rows, I can load these on demand. Now, what am I asking when I read the orders at time T? ALL orders of that given customer. Because I load them at time T, I have the most optimal set of orders for that customer I can get. Say user U1 loads that customer, gets some coffee after that. User U2 creates in the mean time a new order for that customer. U1 comes back, and clicks open the ‘Orders’ collection in the gui. Voila, U1 sees the order U2 just created. With spans, this would not have happened and the data would be less accurate.

    I say ‘less accurate’ and not ‘wrong’, because as you described correctly, every data moved to a position outside the datastore is stale. It now comes down to the following things:

    1) every user of .NET who wants to read/write data from/into a database has to know that the data in memory is stale.

    2) To keep the effect of stale data as small as possible, data should be read as late as possible so the copy in core is more likely to reflect the data in the database

    3) to keep the effect of stale data as small as possible, developers should build in functionality locking, i.e.: locking on the application level of functionalty so users don’t step on each-others towes by doing the same actions on the same data (which forces organizations to schedule work more efficiently, which in the end makes the software more efficiently applied).

    Span’s also don’t free you from stale data, every element in a span is a union, which might look like a 1 query action but it isn’t. Furthermore, spans can be horrible in performance in situation A and fast in situation B while load on demand can be fast in A and horrible in B.

    Loading 100 customers in a gui which can drill down to orders, order rows and products is very inefficient with spans, because you have to load a lot of data you probably will never use. (this is the area where pessimistic locking falls flat on its face too: it locks a lot of rows which are probably not used) Load on demand however can be very efficient then.

    In a remoting scenario it is the other way around. Loading a graph into a root object and passing that root object back to the caller via remoting is far more efficient than a chatty application which loads data on demand.

    Andres: pessimistic locking and optimistic locking won’t help you in any situation. They give you a false sense of safety which isn’t there. First you can always work around pessimistic locking and second optimistic locking/concurrency and pessimistic locking/concurrency will cause loss of work. It’s the REASON why 2 or more processes altering the same data which needs fixing, not the RESULT.

  10. Matt Warren says:

    I would not rule out late-fetching of data with my scenario, I would just make it explicit as opposed to automatic as a side effect of accessing the collection. Demand loaded collections may already be loaded, so a user of the object has no way of knowing how recent the access is. For the GUI example, I’d have the query extract just customers. Then when the UI wishes to drill down, an additional query is fired to get the data for that customer. The behavior is built into the app, and not into the object model.

  11. Matt Warren says:

    Andres,

    I suspect that since the server has to validate anyway upon update that there’s not much use in having the client lock all the data remotely and check the constraints before submitting.

  12. Jason Mauss says:

    Would comparisons on a timestamp field help the situation any?

    An interval at which the timestamp on the client is compared with the timestamp in the "client in memory" data and prompt the user to re-get newer data

    OR

    A comparison of the client and server timestamp values before a persist-type operation is performed?

  13. Frans Bouma says:

    "For the GUI example, I’d have the query extract just customers. Then when the UI wishes to drill down, an additional query is fired to get the data for that customer. The behavior is built into the app, and not into the object model."

    But isn’t that load-on-demand? Looking at your conclusions, I see spans are good and load on demand is bad. What I wanted to say was that that conclusion is too black/white.

  14. Paul Gielens says:

    I hope I’m still on time to participate,

    Matt, in a rich domain model you cannot, and I repeat cannot ignore the load-on-demand (pronounced as lazy loading, blame MS). Demand loading is nothing more then a technical solution to reduce the memory footprint. I suggest you stack up memory in your systems and ignore load-on-demand as a whole, but agree with me to disagree that loading a Customer, CustomerDetail, CustomerOrder(s), CustomerOrderDetail(s), Product(s), ProductDetail(s) etc just to change a customer’s address (imagine you actually made the effort to call service that you ave moved to different state). Can you imagine the service employee saying, please hold on one moment Matt, I get this weird notification of “low on virtual memory”. Wtf, ObjectSpaces?

    Btw: Jimmy, Frans tracking persistence related posts right 😀

  15. I have to second this.

    Full load is idiotic. It is arrogant.

    When I write an object, I can not make assumptions on how the object is used later.

    Order? Auto load order details. Great – all I want to show is a list of the largest orders we ever had, without details – thanks for loading the garbage.

    Order? Auto load order details, articles? All I want is a list of orders that still are unpaid – I am not interested in the articles.

    This means on demand. Only on demand is flexible enough to handle the object being loaded as it should be. When it is needed.

    Now, performance for this is interesting (ask me – the EntityBroker had no support for spans until a week ago, we call them "Prefetch". THis is what counters this. Here the UI or user of the object – which is the ONLY thing that knows what it will do with the objects and what knows what other objects it needs – can define what other objects to load. Nice, isn’t it? The only solution. Possibly, naturally, with a good cache for static objects in addition.

    But this is about as good as it can get. And then, the programmer also need to know what he is doing. A good O/R mapper can solve the paradigm issues, but concurrency and outdated data are something he has to think about.

  16. Thomas,

    Matt is not saying that you should always load the whole graph, he’s saying that you should explicitly tell the O/R mapper what to load up-front. If you need order headers, ask for it. If you need header and lines, ask for it.

    Frans,

    Denying the existence of a problem does not make it go away ;). If your application cannot deal with concurrency issues then it does not work.

    Imagine amazon.com has to decrease the inventory of a product each time you confirm an order. How will it implement functionality locking for that? It should not allow two people to add the same product in the cart?

    If you cannot have a lock on data before checking constraints, then your application does not work.

    Also, when you do span load, you know you are retrieving a set of data that was consistent at some moment of time. If you do delayed loading you don’t.

    If I have a forum web app with users and posts. I load the users, and it has a field with the number of posts. Then I delay load the posts. The number of post can be different, so I get data that is not consistent and that could lead my code to wrong decisions. If I load the user and the posts at the same time, they are consistent.

  17. Christoffer Skjoldborg says:

    The fundamental problem with lazyLoading, IMHO, is that we will have to allow the read transactions to be stretched into very long/potentially never ending ones to ensure data consistency.

    Since this is obvious problematic it is often tempting (and practically feasible in lots of scenarious I think) to allow object graphs to be constructed from several read transactions.

    An alternative, that I’ve used (in a framework co-developed with Jimmy Nilsson), is to have versioning of entities. The Version Number is incremented on any change to the entity (be that a change to a reference field or a value field). When a entity is fetched in a LazyLoaded state, then its Version Number is as well. When an expand happens on an entity then the version number of the existing instance is compared to the version number of the newly fetched one. If they don’t match then the existing entity is refreshed/updated with any new state. This can get to be quite a complex process and is in our case handled by a Workspace (similar to an ObjectSpace). Obviously this scheme is not generally applicable as it requires quite extensive versioning on relationships/references as well as value field changes to be enforced.

    /Chris

  18. Matt Warren says:

    Frans, (way back up the list)

    I think we are agreeing. I never meant to exclude the ability to fetch data later, just that it was not automatic. Certainly, you can write code that would assembly your in memory graph using as many queries as you would like. But you at least know you did this and are willing to deal with the possiblity of inconsistencies.

    Everyone else, thanks for joining in. I’ll respond to more later.

  19. Frans Bouma says:

    Anders: my point was that any low-level concurrency method is always causing loss of work and is always resulting in a horrible way of ‘correcting the error’. After all, the error is seen in the low areas of the application, how to correct it, as the user expects it to work.

    "If you cannot have a lock on data before checking constraints, then your application does not work. "

    Data locking from outside the database can be pretty bad for an application in general (but sometimes unavoidable, admitted). If a webuser locks a row with an action on a website, how long should the lock hold, if the modem of the webuser suddenly drops the carrier? 🙂 I think that’s the point of Matt’s article: hte data is outside the db, it’s therefore stale and you can’t assume it’s not.

  20. Frans,

    That’s exactly my point. To apply the constraint you need to be in the server. Before commiting anything you should read all the data that matters with a lock, check the constraints, and commit. If you are using delay loading, you need to make sure you load that data (i.e., the Customer credit limit) _again_ in the context of a transaction.

    Of course that I’m not saying that the web application should hold a lock in the server.

    I agree that most of the ways to correct the a concurrency error are bad, but think of how merge replications conflicts are solved. You can have rules to apply when a conflict happens, and that way you can have a good automatic thing to do in some cases. Anyway the problem does exists and in some cases you cannot avoid it. Of course I’d like to be able to avoid it, but you cannot. Do you find any way to solve the amazon.com example using functionality locking?

    Regards,

    Andres.

  21. Jimmy Nilson says &quot;I see Lazy Load as being something technical, and it is not important for the Domain…