Bill de hOra and my old friend (from our Amazon days) Mike Dierken commented on my use of SOA versus “distributed systems”. There was also an interest in my perspective on the CAP Conjecture. Let me spew forth some thoughts…
It may be a bit unusual, but my way of thinking of “distributed systems” was the 30+ year (and still continuing) effort to make many systems look like one. Distributed transactions, quorum algorithms, RPC, synchronous request-response, tightly-coupled schema, and similar efforts all try to mask the existence of independence from the application developer and from the user. In other words, make it look to the application like many systems are one system. While I have invested a significant portion of my career working in this effort, I have repented and believe that we are evolving away from this approach.
I wrote a paper for CIDR 2005 called “Data on the Outside versus Data on the Inside”. In that paper, I explore the difference in the semantics of data when it is unlocked. Inside of a database, the meaning of the data can be interpreted with a clear and crisp sense of “now” provided by the current transaction. Nothing moves when you are in a transaction unless the currently running application that began the transaction changes the data. There is a strong sense of stillness and of now. Inside data is very much what we have historically programmed to.
In “distributed computing” (in my unusual usage… not in the commonly accepted vernacular), we are trying to extend this notion across multiple machine.
In SOA (again, how I think of it), we are acknowledging the existence of independent machines. This affects the transactional scope (we end up with different chunks of data which cannot be updated by the same transaction) and we end up with independently evolving schema and operations for the different systems. This is a seminally different concept than distributed systems (at least in the way I think of them).
If you try to impose a global ordering to the transactions across a LOT of systems, it is very much like the way Newton thought of the Universe with time marching forward uniformly everywhere. This is why I say that “distributed systems” are like Newton’s Universe.
Now, let’s consider SOA with independent scopes of serializability (i.e. the collection of computers has some different groups of data which are independent in their transactions… you cannot do a transaction across these different chunks of data). These are broken into independent systems (typically independent applications) which encapsulate their own data and communicate via messaging. When System-A sends a message to System-B, the data contained in the message will be unlocked before sending it. That means that the data is a historic artifact. System-B can only see what some of System-A’s data used to look like. This is an essential aspect of these independent systems which do not share transactions. Because of this, I think of each system living in its on temporal domain. It is aware of its internal state and is aware of a subset of its partner’s older state. This is just like looking into the night sky and seeing light from neighboring stars emitted years earlier. Each of these systems lives in its own time and has an independent view of time. To me this is like Einstein’s Universe where time marches forward based on the perspective of the viewer.
So, the move from distributed systems (one transactional scope –> one notion of time) to SOA (independent transactional scopes –> time based on the perspective of the user) is like moving from Newton’s Universe to Einstein’s Universe.
—–> Now, let’s rant for a while about the CAP Conjecture…
First, let’s summarize what Eric Brewer said in 2000 in an invited keynote at the Principals of Distributed Computing. While I am not super familiar with all the literature on this, I believe I understand it well enough to spew forth. Eric said that if you consider CAP – Consistency, Availability, and Partition-tolerance, he offered a conjecture that it is impossible to achieve all three. I totally believe in this conjecture but want to offer some twists in how to think about it.
First, I noticed a long time ago that there is an intrinsic conflict between consistency and availability in the face of partitions. This has led to me growing to increasingly dislike distributed transactions (see “Life Beyond Distributed Transactions: an Apostate’s Opinion”). The two-phase commit protocol (which I’ve spent big portions of my career working on) will ensure perfect consistency given infinite time. I say that because it will wait and wait and wait until the transaction is resolved and then provide perfect consistency. Of course, while partitioned and waiting, arbitrary swaths of the application’s database may be locked up rendering the application unusable. For this reason, I’ve frequently referred to the two phase commit protocol as the “Anti-Availability Protocol”. It is increasingly clear to me that this protocol is best used sparingly.
What I think is interesting is how real world applications modify the definition of Consistency and Availability to provide Partition-tolerance. Note that my observations about this do not invalidate the CAP conjecture (which I think is correct) but show how the pain is dramatically reduced by loosing up some age-old assumptions about distributed systems.
Classic database/transaction approaches to Consistency choose to emphasize read-write semantics. To preserve Read-Write-Consistency, you lock the data. We’ve been at this for over 30 years. What I see happening in loosely-coupled systems is identical to how businesses operated 150-200 years ago when messages were sent with couriers running across the city between businesses. You allocated (i.e. reserved) the ability to perform an operation and then later on you would take the confirming step to ensure the completion of the work. Today, you make a reservation at a hotel and then later on you show up to complete the operation. What is the definition of consistency in this world? It is the successful remembering of the reservation and then keeping a room for you. The reserved room count is not locked waiting for you to decide if you want the reservation, waiting for you to cancel, nor while waiting to see if you show up. The definition of consistency evolves to one that is explicitly including independence and loose-coupling.
A really good paper on this concept is: Isolation Support for Service-based Applications. In this paper, Paul Greenfield et al argue that predicates are the best expression of the isolation required while performing long-running work that spans loosely-coupled (SOA) systems. For example, my work may allocate $200 from your bank account because it is supposed to be mine if the cooperative work we are doing commits. Your bank account balance does not get locked, just $200 of it is encumbered pending the outcome of our cooperative work.
This bends the definition of both Consistency and Availability. Usually, they are considered as providing Read-Write semantics and so they are implemented with locking. If you provide operation semantics, this just changes the form of locking and changes the form of consistency and isolation. I first saw this from Pat O’Neil in The Escrow Transaction Method published in 1986. In this paper, Pat observes that the use of operations, and operation-logging can increase concurrency. At Tandem, we did this in the late 1980s to allow addition and subtraction to have special behavior to improve our TPC-B benchmark numbers. What we did was detect when a SQL operation was an addition or subtraction to a field of a record. We would log a “+30” or “-50” as the operation in the transaction’s log. We would also keep a worst-case lower-bound and a worst-case upper-bound for the field’s value based on it’s committed value combined with the pending (not yet committed nor aborted) transactions. We would manage that the field would remain within the accepted bounds. If you performed a “+30” and then later on aborted the transaction, the act of aborting the operation would subtract 30. In this way, all the transactions would correctly perform their operations but we could run MANY transactions concurrently against a very hot-spot value. Of course, an attempt to read the value would shut everything down, wait for the pending transactions to settle, show the underlying result, and then let the chaos resume. This was a very successful technique for coping with highly-concurrent commutative operations (in this case addition and subtraction) against a hot-spot value.
Nowadays, we see this same technique applied at a different granularity. If you move away from thinking about locking data at distance and move towards ensuring the ability to perform a reserved operation, you think about this differently. This is why you can reserve a king-sized non-smoking room but not (typically) reserve room 301. What is being promised is a member of a fungible category of rooms. It is a different promise of consistency.
My recent post “Memories, Guesses, and Apologies” plays off of what I see happening in an increasing fashion. More and more, I see businesses being willing to loosen Consistency even more than what I was describing above. They are willing to occasionally give the wrong answer because it is more cost-effective.
In the presence of imperfect availability of knowledge, a business is forced to choose between closing down service (reducing availability of the service), over-booking, or over-provisioning. Indeed, if multiple systems (or humans) are extending commitments independently, they must choose between over-booking, over-provisioning, or some unknown balance between them. If I have 10,000 widgets to sell and 100 salespeople, I could allocate 100 to each sales-person and know that I have not over-booked if they go out and independently sell the widgets. To do this, though, I am almost certain to need extra inventory for the sales-people that don’t sell all 100 of their widgets. Indeed, for most businesses, this is a ridiculously expensive proposition. So, an analysis is done on the statistics, a cost of over-booking is calculated, and allocations are given based on the expectations of selling the 100,000 widgets. The loosely-coupled algorithm explicitly allows for a probability of over-booking based on its cost-benefit. This is a relaxation of the definition of Consistency to cope with the realities of the CAP Conjecture.
It is essential to approach computing as a means to support business, not a religious fervor. I don’t think that it is “wrong” to relax consistency, I think it is important to understand the business trade-offs and apply the technology realities to support the business effectively.