Initial Forays into Transactional Memory Programming Models

 

Welcome again to the TM blog! This time your host is Yossi Levanoni. I’m the dev lead on the TM project.

It was great to see the high-quality questions we’ve gotten in response to Dana’s previous (and our altogether first) post on the blog and I’m looking forward to many more interesting discussions. As Dana said, TM is not at a product development stage yet; it is currently an incubation and as such our dialogue with you—dear reader—is going to be less prescriptive (e.g., “this is how you use API XYZ”) and more a two-way conversation (e.g., “what kind of TM would be useful for you?” What value do you place on feature X?”) We will also discuss the semantics and implementation techniques of the TM systems we’re incubating (but remember that we are not an open source shop!).

I would like to devote this post to a discussion on TM Programming Models. We would try to see what kind of questions are pertinent in such a discussion by examining a few choices (and alternatives) used by Tim Harris from MSR and his colleagues working on the Bartok STM. A little bit of background: Tim and colleagues have been researching the topic and have published some seminal papers on TM. Some of Tim’s research on TM has been conducted before joining Microsoft. Then, Tim and Simon Peyton-Jones and others have implemented TM for the Haskell programming language. Finally this work culminated in an implementation of TM for the Bartok CLR which is described in this paper. Our group (Parallel Computing Platform) continues to work closely with the MSR folks including Tim, Simon and Martin Abadi, an expert on memory models their formal definition and verification.

Putting the first thing first, we have to define our scope of investigation: What is TM? Is it just an implementation technique for achieving thread-safety? Is it a class of programming models that is independent of particular implementation strategies? Maybe it’s both? What about hardware transactions, are they part of the game? Lock elision? What about concurrency-safety annotations? These things are obviously so interrelated from an end-result perspective that we will discuss them all in this blog, in the same way that Microsoft considers them all “fair game” in order to improve the mechanisms developers have in their toolbox to manage shared state. There are going to be two unifying themes that will encompass most discussions though:

1. Automating thread-safety concerns (what we want to achieve).

2. Executing code speculatively or tentatively (what the technology does at its core).

 

While this post confines itself to the STM model of Bartok we will definitely consider the broader picture on this blog, both in terms of programming models, and in terms of implementation.

The Bartok STM Programming Model

Without further ado let’s start considering the Bartok STM programming model. The central idea of this excellent work is that you can write:

atomic { <statements> }

The <statements> within the atomic block would be totally isolated from the effects of other atomic blocks executing concurrently. Great, what does that actually mean and where are the gotchas? Here are a few:

User-Initiated Abort

In the database world, both the programmer and the system have a way to specify that the current work being done against the database (the transaction) should be cancelled and its effects rolled back. The user has the ability to instruct the system to abort the transaction and the system has the prerogative to abort the transaction from under the user. What are the equivalents of these options when automatic isolation for shared memory is considered? Let’s first consider user-initiated abort.

In the Bartok TM programming model the user does have the option to abort a transaction. This is done by letting an exception escape the boundary of the atomic block. As commentators to the previous post have astutely noticed, this feature has some radical implications, not just for concurrency but for single-threaded execution as well. This feature indeed changes the semantics of single-threaded execution in the sense that things will get rolled back (automagically) if the transaction is aborted. This is a very powerful feature that holds a promise to drastically improve the reliability of code under error conditions.

Of course, nothing is for free and providing user-initiated abort comes with a price, both in terms of performance and implementation complexity but also in terms of the complexity of the programming model (which is the focus of this post). Consider this piece of pseudo-code:

            atomic {

           FormatHardDrive ();

     }

FormatHardDrive cannot be rolled back once executed, but we have given the user the ability to abort the transaction, which we can’t retract. One option is to simply forbid operations like “FormatHardDrive” inside an atomic block; if the programmer tries to do something like that then the program will either fail to compile or raise an exception at runtime. In this particular case suppose that the implementation does allow any kind of operation inside an atomic block and the user wrote the code as:

 

      atomic {

            FormatHardDrive ();

            throw new Exception();

      }

Now we’re really trying to have our cake and eat it, too! FormatHardDrive cannot be rolled back, so presumably the user has the intent of not rolling back ever after executing this operation. On the other hand, the user also deterministically asked for the transaction to roll back. This contradiction between operations that signal to the system that the transaction shouldn’t roll back and explicit user-initiated abort could be solved only in so many ways, each leading to an interesting programming model in its own right. We will explore these options in future blog posts but as a quick tip of the hat I would mention the work that our Intel colleagues have been doing on irrevocable actions. (Digression: we have a little bit of an issue with the term “irrevocable transactions” because the outcome of a transaction, in a traditional transaction system, cannot be decided while the transaction is executing. Anyway, the concept is clear and useful: this thing, which is not a transaction in the usual sense, becomes irrevocable.)

 

System-Initiated Abort

As a somewhat related but separate issue we have the question of whether it is desirable to have a programming model where system “problems” (e.g., inability to commit due to contention) are exposed to the user. If we think of atomic blocks as a sort of lock replacement, then one analogy to the lock world would be the Monitor.TryEnter API. This API allows the user to take action if a lock cannot be acquired within a given deadline.

 

In the Bartok STM programming model there really isn’t a way to “try” executing a transaction---control flow resumes at the point following the atomic block only after the transaction has successfully completed. This completion event is always consistent, no matter whether the transaction has committed or aborted---both events are consistent, which means that all completed transactions in the system can be ordered serially with respect to each other. The research community uses the term serializability for this property.

 

“Purists” would typically object to API’s such as Monitor.TryEnter or TryAtomic on the grounds that it promotes user-level backoff and retry, which should be better left to the system. The “realists” would argue that sometimes there are other things to do, if the lock cannot be taken. The “purists” would argue back then that the application should express the availability of work using scheduling constraints rather than synchronization constraints. This argument also gives rise to thoughts about higher level constructs, such as a construct that lets the system choose between different alternative transactions. This way if the system experiences contention executing one alternative it can choose another alternative and try it out.

As with other considerations, we seek successful precedents in two domains to guide us. These domains are:

1. The lock-based programming domain

2. The database transactions domain.

We have discussed the former, what does the latter has to say on this? In the database world there is the assumption that since the system is distributed, something can always go awry and therefore the user must always be ready to deal with failures. This is a big burden for the application developer that is many times dealt with by simply turning around and presenting the error to the end user. One has to wonder whether the same considerations apply in shared memory programming, where we have the working assumption that storage (caches and RAM) and communication paths (data flow through the system) are much more reliable.

 

Different Forms of “abort” and “commit” Statements

In the Bartok model the only way to commit a transaction is by reaching the right curly brace of an atomic block. People often ask whether it makes sense to offer a “commit” statement that commits the work done so far and continue from that point in a separate transaction. The analogy to monitor based programming is the ability to briefly release and reacquire a lock somewhere (dynamically) inside a lock (<obj>) {} statement. This is obviously a dangerous pattern and a source of bugs, but it is also essential for coding coordination patterns with Monitor.Wait and Monitor.PulseAll (the reader is reminded that when Wait is called, the target monitor lock is released and then reacquired after the monitor is pulsed by another thread).

 

Similarly, when considering an explicit “abort”, in the Bartok model a transaction may be aborted only by letting an exception escape beyond the right curly brace of an atomic block. Again we could ask whether we should allow explicit statements to abort a transaction either in addition or instead of abort-by-exception. It is useful to consider the implications of decisions here to call-based composition of code. The Bartok model offers a remarkably strong proposition in terms of composability: a caller is always in control of the work a callee is doing and its inclusion in the outcome of the transaction:

1. The caller can always discard the changes done by the callee by wrapping the callee in a nested transaction and aborting it if necessary.

2. The caller is also in full control of exceptional situations. It can always catch exceptions from the callee and if it knows how to handle them, it can decide to still commit.

 

The following code snippet illustrates:

 

void Caller() {

    atomic {

        try {

            Callee();

        }

        catch (MyException e) {

            // Do something alternative here

        }

        If (IDontLikeTheStateOfTheWorldNow()) {

            throw new VetoException()

        }

    }

}

The caller sometimes accepts and sometimes rejects success of the callee. The caller sometimes accepts and sometimes rejects failure of the callee. The caller is in control. This allows a great deal of call-based composition.

 

Coordination Constructs.

The atomic block as presented thus far provides isolation and atomicity but it doesn’t offer much help with coordination of concurrent activities. i.e., how would you implement a blocking queue using transactions? Harris et al promote the retry statement as the central way to coordinate threads involved in transactions. The semantics are dead-simple: if you hit a retry statement then the transaction rolls back and re-executes. e.g., consider this example:

 

public class SingleCellQueue<T> : where T : class {

      T m_item;

 

      public void T Get() {

            atomic {

                  T temp = m_item;

                  if (temp == null) retry;

                  m_item = null;

                  return temp;

            }

      }

 

      public void T Put(T item) {

            atomic {

                  if (temp != null) retry;

                  m_item = item;

            }

      }

}

Suppose a thread calls Get() and m_item is null. In this case the retry statement would be encountered, the transaction will roll-back and re-execute. At this point I can imagine visible unease in the crowd: is it just going to spin and re-execute, eating up my CPU and killing my battery? Well, hopefully not---an implementation can be infinitely dumb or infinitely smart about how to implement retry. But still, let’s recognize that the developer doesn’t tell the system when would be a good time to pulse waiters, unlike condition variables and monitors. Harris et al also explain in their paper on composable transactions how retry and orElse can be combined together to achieve composability of coordination, which is not possible with explicit condition variables. So on one hand this is a simplification for the developer, but on the other hand it’s costly it for the system to implement.

 

Another property of retry is that it rolls back anything done so far in the transaction, so it is not possible to externalize to the outside world why you’re waiting and what it is that you wish be done by parallel activities, which is a pattern common with condition variables.

 

What are some other approaches to coordination? One possibility is what I would term “abstinence”. According to a majority of people I have sampled blocking is to be discouraged and replaced with various dependency/constrained scheduling schemes for tasks (e.g. the .NET asynchronous programming model). The core question is whether you own the thread or not. If you don’t own the thread, then blocking should be avoided. If you do own the thread, then you need to ask yourself “why am I owning the thread instead of using cooperative task scheduling?” In other words, cooperative task scheduling, with the ability to describe and enforce task dependencies is the “wave of the future”. So maybe this means that providing a blocking construct for atomic blocks is a second-level concern and maybe as such, a solution which is least impactful for the system should be preferred, at least initially.

 

Another option is to define transactional counterparts to condition variables that you explicitly wait and pulse. Condition variables have the semantics of “committing” when you call Wait and thus they bring the question of unconditional commit to the forefront again.

 

Conclusion

In this post we have started skimming the surface of possible programming models for a managed language TM system. There are many other considerations that we will cover in the future. My goal was to provide a brief “insider” look at the factors that we are looking into as a preparation to more concrete questions about the specific choices we are making. Please feel free to follow-up with questions or comments on any of the above topics or others.