ObjectSpaces: Spanning the Matrix


I guess this is turning out as a series of ObjectSpaces posts.  I did the first one a few weeks back covering the origin of OPath in “The Power of the Dot,” and followed that with “What’s the Big Idea.”  Now I have a few more in the cooker that you might find interesting.


 


At the time when ObjectSpaces was still in swaddling clothes, barely a prototype still teething on my hard disk, before it grew up, graduated, matriculated and finally started to dress for success, I was still experimenting with the concepts underlying object query and manipulation.  We had not set out to build an object database, but Luca and I were finding that many of the same problems that plagued object database designs were infesting our plans as well.


 


One of the hardest things to get right with an object database, a database that allows you to persist your objects as your objects, complete with dynamic reference lookup (swizzling et al), is the ability to gauge how much data to pull down from the server at one time.  This doesn’t sound like such a big deal if you are familiar with a relational database.  With a relational database you specify exactly the amount of data you want with every query using a projection; that is unless you like running slipshod with your asterisk hanging out.  With an object query, you generally want to retrieve the whole object.  Well, at least that’s the paradigm you want to write your program against.  What you really want is for the underlying system to figure out just what data you are using and only go fetch that at runtime. 


 


Of course, this turns out to be an unbelievably hard thing to do.  Some of the bits of data on the object are cheap to fetch and access, other data is more costly.  The typical example is the large binary photograph stored in your account database.  You don’t want to keep pulling that one down every time you go after the current balances.  Yet some times you need that data and some times you don’t.  If you do need it and you are processing large batches of accounts on the client, you don’t want to fire off an additional query per account just to get that part.  Unfortunately, this is only the tip of the iceberg.


 


Remember, with objects your data is defined in a network, or graph.  When you had a relational database and wanted to correlate information across multiple flat tables you wrote a query that introduced an explicit join operation.  You told it what to access and what to bring back, even if that was an explosive Cartesian product.  It was what you wanted so you knew what to expect.  But with a graph, all that’s changed.  Now you have these convenient little properties on your objects that let your navigate around your object space, pulling in bits here and there without so much as an if-you-please because it’s just too darn easy.  Heck, it’s a party.  Have drink, here’s a swizzle stick.  Sure, all the data is here in memory.  It’s been here all along.  Have you met it, the whole graph I mean.  Don’t believe me?  Let me introduce you.  Just start following those dots.  You’ve got customers?  Give’m a dot.  Now you’ve got Orders, and if you peek just a little further you’ll see that you’ve also got Order-Lines and Products and Shippers and Warehouses and dang, is this the whole database or what?


 


So you see, the problem extends well beyond what you might have first guessed was the data in question.  When you use the object paradigm you conveniently start to forget there’s actually a database underlying all this stuff, and that you still are sucking down bits through a straw from somewhere in the far beyond.  When you start off writing queries against Customers, you could easily end up manipulating Orders and Addresses and what not.  Or you could not.  Maybe you’re only interesting in pulling up a phone number.  How do you go about making sure the system is not pulling down thirty gigs of data when all you wanted to do was order a pizza? 


 


It’s easy, you might say.  There are many solutions to this problem.  It is true.  We’ve looked at most.  Some are practical, and some are highly theoretical and dicey.   They basically focus on solving the problem in one of three ways.


 


1)      All related items are fetched on demand.  The system is optimized for common scenarios of only accessing closely related primitive values.


 


2)      An administrator hand optimizes the definition, declaration or mapping of the object structures, identifying which bits of data naturally group with other bits of data.  They take into account the likely access patterns and optimize for those.  Grouped items pre-fetch together.  Related items are pulled in on demand via additional queries.


 


3)      The object database system is designed to optimize itself over time, adjusting the pre-fetch/demand-load relationships between objects.  If most of your apps access Customers without Orders, Orders will always be loaded on demand via additional queries.  If they are generally accessed together, Orders will always be queried for and brought back at the same time Customer data is retrieved.


 


The first option is basically admittance that the technology is a novelty and not intended for rigorous use.  The third option is highly suspect and specious.  While it is certainly possible to build a system that does in fact monitor its usage and tunes itself, it is utter foolishness to think that this would suffice for any well used database.  Real databases host myriad applications with significantly different patterns of use.  An autonomic system will optimize for the middle ground, and will never behave optimally for any case.


 


That’s leaves you with option #2, which is the most popular method in use today.  Yet, it generally has the same problem as option #3, except that a human brain has chosen to optimize it in favor of one of the two extremes.  It’s not that human brains are fallible.  It’s just an impossible problem to get right.  You see any system that relies on static assessment of optimization parameters is bound to be wrong half the time, and any system that relies on dynamic assessment of optimization parameters across the gamut of real-world applications is bound to be wrong all of the time.


 


So there we were, Luca and I stuck with the same problem that every O.P. system to date has had to face.  We knew we would have to pick one of the options and just go with it.  Luca hoped we wouldn’t end up with the first option, and was willing to go with the second.  I hated all three.  I loathed them, because I too am a programmer, and I too know what the customers will face when trying to build their applications.  Handing over the power to do your job well to an administrator that may know nothing about your needs seemed just a bad idea.  It’s the application programmers that know what the app is about to do.  They know what data they need, just like they did when they wrote apps against relational databases.  What we needed to do was put the power back in the hands of the people.  If SQL can do it with projections of rows, then by-gum, we certainly were going to figure a way to do projections of objects.  And that meant projecting everything, even the network, the graph, the entire matrix.


 


That’s when I hit upon the solution.  Like the dot, it was the graph stupid.  Object queries needed that additional bit of information that would allow the user to specify exactly what reachable parts of the network should be pre-fetched together.  So I took the OPath parser and added an additional context that would allow the specification of a list of properties, and sub-properties and so on that would form a tree of access paths.  Anything touched along these paths would trigger a pre-fetch of that data.  With a simple list of dot expressions you could easily specify what part of the matrix you wanted to span. 


 


The best thing about it was that it was an opt-in solution.  It was a way of saying, “Hey, I’m going to be dotting through this stuff, just wanted you to know.”  If you did not specify the new span parameter, then you could still navigate through all the properties, reaching all your data.  If you did, then your server could put the order in all at the same time, so all the data would come back hot and ready to eat.


 


Luca was a bit dubious of the idea at first, but I wore him down.  Ever since then, ObjectSpaces has had the span parameter, and yes the name derives from my own ultra-nerdiness.  It refers to the span of a space as used in linear algebra.  Crack a book if you don’t believe me.  I’m sure you have one in a box in your basement like I do.  Wife got rid of it years ago?  Mine tried too.  I tricked her by re-labeling the box, “trinkets, odds and ends.”  That stuff she keeps.


 


But I digress.


 


Matt


 

Comments (14)

  1. How about some source code examples, please??? 🙂 🙂 🙂

  2. Jason Mauss says:

    Matt – interesting post. What came to my mind, before getting to the solution you mentioned (which I still don’t quite understand but am trying to..) is just adding a "weight" attribute to certain elements in the XML files (the mapping files or whatever they’re called). This lets the developer (or "admin" for the mapping files) define whether an object’s property or sub-object properties (like an orders collection of order objects within a customer object) should be retrieved from the database upon the "initial query" or upon the first attempt to access the property/collection/whatever.

    for example:

    <element weight="heavy"> the property or collection would be retrieved upon first access attempt whereas

    <element weight="light"> would be retrieved with the initial set of queries.

    Just a thought.

  3. You REALLY make it sound like you invented something

    Selected prefetch paths have been part of any decent O/R mapper for many years now.

    @Jason Mauss:

    VERY bad idea. Because this means that you predefine prefetchs with the mapping information. Now, the bad thing here is that this is making assumptions about their use. Only when I ask for the objects in a process will I actually know what I want to do with them.

    Example:

    Invoice – InvoiceItems. Preload or not?

    * I want to show an invoice on the screen: sure I preload.

    * I want to show a list of invocies for a customer. No preload.

    One mapping. One applicaiton. Two uses.

  4. Frans Bouma says:

    I second that, Thomas.

    Load on demand most of the time works as best as you can get: the user navigates to an invoice and at that moment the invoice items are loaded because they are displayed on the screen for example. Elegant solution and no performance hit whatsoever: the invoice items data isn’t loaded upfront and when required you only load the items required at the moment you need them.

    What some people try to achieve are lists. Say you want all orders for a given period and with those orders the customer name and the contact person. How to do that fast using an O/R mapper? You can only make that fast with a pre-load scenario, which will take 2 queries: one for the orders and one for the customer.

    You can also create a ‘view’ based on the order entity and the customer entity, add only the fields you want to show up in the list and read the data in 1 query. We call it a Typed List.

    The advantage of a Typed List are not only the single query, but also the fact that you don’t have to deal with the dilemma Thomas is talking about: you always fetch the data in 1 go.

    I have yet to see an example where the span solution is more efficient than load-on-demand. Don’t forget, we’re talking efficiency here. So things like:

    int id = myCustomer.Orders[index].OrderItems[index2].SomeId;

    are not efficient. In that case, better set up a query to pull the OrderItems you want using a filter based on an orderid set.

    Another thing is how the preload is done. 1 query per parent-child relation? Or a single query with a big join? The latter is slow in processing as you have to weed through a lot of data in some cases, and you’re pulling a lot of data over the wire which is not necessary. The former can be efficient, but requires OR subqueries OR an IN clause with specified valies. Neither one of them is particular fast and you have to post-process the data read: you can’t simply put the read data into a bucket and be happy, the read data is data which belongs to multiple entities. You can of course sort on an ID and loop through it, but what if the data has to be sorted on a different field? This makes postprocessing particular hard to do or at least slow.

    All these things have to be taken into account. Is it then still faster than load on demand? Only in situations where a list is appropriate to mimic the flexible tuple creation power of the relational model, but even then a list beats this span feature hands down.

  5. Matt says:

    In fact, in our survey of OP solutions nearly five years ago, we did not find anything comparable to the span. That means customers of these products that we talked to did not know about it either. That doesn’t mean it wasn’t out there, just that we didn’t dig it up. So in way, I did invent it, even if only to re-invent what had already been done. I still had that aha moment, and it still makes for a good story.

  6. I personally think that ANY solution that maps tables and relationships in a relational database to objects and collections in a Object Oriented Enviroment must allow the user (programmer) to configure manually the data-population mechanism.

    Sometimes we want to draw a tree in the screen, when the user clicks the "+" button we simply call the "father.sons" method that executes the corresponding query on-demand, and returns the list of sons.

    Sometimes we don’t, we want to show the user a flat list of "thing.name + thing.owner.name + thing.owner.address". In this case it would be preferable to JOIN tables.

  7. David Goldstein says:

    Yeah, these choices are painful.

    My thinking so far (which is starting to look smaller the more of your blog I read) for the ideal is like this:

    – provide specialized collection datatypes, all the better if you can use generics for type-specialization , that when accessed will demand-load the items

    – this specialization is, as you point out, important for blob/clob (large data) types

    – you can allow the developer to declare arrays or maybe collections instead, in which case the pattern is almost certainly preload

    – for fine-grained control, your mapping layer could allow the declaration of "property groups" which can then be explicitly excluded (or included?) using a more advanced function call or query-context setting

  8. You can have the same effect, just easier and hassle free. Here is how (I call it "virtual objects")

    If you want to know more, send me an email.

    Kind Regards

    Martin Roesch