Immutability, Purity, and Referential Transparency

How often do you write code that just works?  It seems to happen so rarely that I find myself suspicious if code does just work.  When was the last time that you wrote a non-trivial program that had no compile errors on the first try and then ran fine as well?  If it happened recently then congratulations.  If it hasn't then join the club.  Admittedly, most programmers don't just write a whole program and then submit it to a compiler and a test harness to see if it works.  Typically it is built incrementally: a function here, a class there, testing all the time.  Code-Compile-Test...rinse and repeat.

I recently read a post where the author describes why Haskell programs just seem to work.  It got me thinking about my own experience in Haskell.  One of the most interesting things about Haskell is that many programmers don't use a debugger (GHCi does have one).  It seems that the impulse to break out the debugger to better understand code occurs far less than in other languages. 

Can you imagine working in C# without a debugger?  So why is it that many Haskell users seem to get along without a debugger?

A Tale of Two Sorts

Consider writing a function that sorts an IEnumerable<T> using the quicksort algorithm.  Here is one possible implementation:

static IEnumerable<T> QuickSort<T>(this IEnumerable<T> sequence) where T : IComparable<T>
{
var array = sequence.ToArray();
QuickSort(array, 0, array.Length - 1);
return array;
}

static void QuickSort<T>(this T[] array, int start, int end) where T : IComparable<T>
{
if (end - start < 1)
return;
var pivot = array[start];
var j = start;
for (var i = start + 1; i <= end; ++i)
{
if (array[i].CompareTo(pivot) < 0)
{
++j;
var temp = array[j];
array[j] = array[i];
array[i] = temp;
}
}
array[start] = array[j];
array[j] = pivot;
QuickSort(array, start, j - 1);
QuickSort(array, j + 1, end);
}

Now imagine that ++j was in the wrong place (at the end of the block).  The output of sorting

4, 3, 6, 9, 1, 0, 2, 7, 5, 8

would be

4, 4, 4, 4, 4, 9, 6, 7, 7, 8

Clearly this isn't right.  So after a quick scan of the code, most programmers would attach a debugger to see what is going on.

The programmer might then step through the code until some state transition ends in a corrupt state.  Hopefully the programmer understands the intent of the code so that a fix can be issued to bring the implementation in line with the intent.

Contrast the previous definition of quicksort with the following definition:

static IEnumerable<T> QuickSort<T>(this IEnumerable<T> sequence) where T : IComparable<T>
{
if (!sequence.Any())
return sequence;
var pivot = sequence.First();
var remaining = sequence.Skip(1);
return ((from x in remaining where x.CompareTo(pivot) < 0 select x).QuickSort())
.Concat(pivot)
.Concat((from x in remaining where x.CompareTo(pivot) >= 0 select x).QuickSort());
}

What do you notice that is different besides the number of lines of code?

The first definition creates an array and then does in an place quicksort on the array, destructively updating the contents.  It creates several temporary values (i, j) needed to store intermediate results.  If a programmer hasn't written the algorithm recently then chances are that the code will contain at least an one off by one error or some other similar error.  I've seen too many people do poorly on whiteboards on similar problems (where incidentally there is no debugger, syntax highlighting, or intellisense) to believe that most people would just get it right the first time.

The second definition does not modify any data.  Sure it makes assignments but each of these assignments are to a newly introduced variable.  These variables are really just aliases for the expressions that are assigned to them because the expressions are side-effect free.  Furthermore, the second definition is essentially just a brief description of what the quicksort is rather than a step by step description of how it is accomplished.

At this point, some people will probably be thinking, "But isn't the first approach so much faster than the second approach?"

Maybe, maybe not.  Perhaps it doesn't even matter.

I think it is important to use the following principle:  Make it correct, make it clear, make it concise, make it fast. In that order.  

If I find that the code in question is not performant enough given the user scenarios that we need to support or if the code is overly complex for the sake of eliminating side-effects then I consider adding state.  But note that if I wanted to introduce state to our quicksort, I could do it in two different ways: contained within the quicksort function or exposed to the caller of quicksort.  This is somewhat analogous to calling either the first function or the second function in the first definition of quicksort.  If the state is exposed then quicksort could do an in-place sort on the collection passed to the quicksort function.  Alternatively, quicksort could simply return a new collection that is sorted and not update the original collection.  This leads to the second principle:

Write side-effect free code. If side-effects are required then confine them to the smallest scope possible.

Most programmers have been following this to some extent for years.  We all know that generally it is not a good idea to use global variables.  This is basically the extreme of exposing side-effects (the global scope).  Unfortunately, many of the programmers who don't use global variables don't realize that the same principles apply to fields, properties, parameters, and variables on a more limited scale: don't mutate them unless you have a good reason. 

A Little Story about State

Some time ago, I wrote a post about a way to save the state of an IEnumerator<T>.  What I didn't tell is the story behind the post.

I was writing a managed lexer and parser generator.  The system is pretty neat and includes things like GLR parsing, automatic inference of the AST types, automatic error recovery, and whole slew of other features.  The parser takes an IEnumerable<IToken> in order to decouple the lexer from the parser.  However, for many of the features of the parser, it needs a way to save the state the of the underlying token stream and possibly restore it at a later time after more of the stream has been consumed.

Originally, I wrote a MarkableEnumerator<T> which wrapped an IEnumerator<T> and consumed the input.  Clients could request a mark which they could then restore at a later point.  If they did this then the MarkableEnumerator<T> would continue at the restored point when MoveNext was called again.  This little piece of software (about 100 lines of code) caused Cyrus and me lots of trouble.  It was super complicated because it was heavily optimized to keep its internal storage as small as possible while preserving requested past state (and possibly restoring it).  It used a cyclical buffer and a number of opaque handles.  One of the worst problems is that it was a somewhat leaky abstraction since the complexity leaked into the code consuming it.

When the code was first introduced it was reasonably contained but over time its inherent complexity began to pervade the whole parser.  Eventually, it was hard to make improvements to the parser without breaking the invariants of the MarkableEnumerator.  It was an absolute horror.  I remember one night well past midnight, Cyrus and I were talking about how much we hated the thing, but we needed it to provide good performance.

Then it occurred to us that we could replace the whole mess of opaque handles, circular buffers, mark and release code, and MarkableEnumerators within MarkableEnumerators with just a linked list that wrapped the enumerator.  If someone wanted to save the state of the enumerator and resume it then all they need to do is keep a reference to the desired node.  To all of the clients, it looked just like an immutable linked list.  That is it.  So much simpler and guess what?  It was almost as fast.  But even more importantly, because of its simplicity we were able to make even more important code faster.  Our development time for new features rocketed again.

There are several lessons from this story, but one of them is perhaps the most important considering our current discussion.  Reducing the number and scope of side-effects simplifies the code.

Immutable Binary Trees

Mutation often seems to just cause problems.

Suppose that we want to create a BinaryTree class.  One straightforward implementation is

class BinaryTree<T>
{
public T Value { get; set; } // yes, we have autoimplemented
public BinaryTree<T> Parent { get; set; } // properties in case you haven't
public BinaryTree<T> Left { get; set; } // heard
public BinaryTree<T> Right { get; set; }
}

I don't like it very much.  Those sets give me an uneasy feeling.  Of course it depends on how the objects are used, but suppose that we intend to use the binary tree in a multithreaded environment.  If someone changes any of the members of a node there is a chance that some other thread already has a reference to a node in the tree and will then have a tree in an inconsistent state.  Yes, we could use a lock to fix the problem but I tend to like a different approach to this problem. 

class BinaryTree<T>
{
T value;
BinaryTree<T> left;
BinaryTree<T> right;

  public BinaryTree(T value, BinaryTree<T> left, BinaryTree<T> right)
{
this.value = value;
this.left = left;
this.right = right;
}

  public T Value { get { return value; } }
public BinaryTree<T> Left { get { return left; } }
public BinaryTree<T> Right { get { return right; } }
}

This is essentially the same as the first BinaryTree except that it is immutable (no sets and no methods causing mutation).  This requires that we have a constructor that takes in the appropriate values.  Also, you may have noticed that there is no Parent property.  The reason is that if the node is immutable then it is hard to know what the parent is at the time of creation.  With an immutable tree, you generally can either create parents before children or children before parents.  I prefer the later (despite the biological conundrum posed by this).  Note that this doesn't mean that we can't know what the parent of a given node is.  Indeed we can, but we need to introduce a new concept before addressing this.

Nodes can really be parented by n nodes.  Since any node can be constructed with a given node as either the left or the right child.  But given a tree (a root), we can enforce (assuming that we don't want DAGs) that a node has at most one parent.  We will call this a BinaryTreeVersion.

class BinaryTreeVersion<T>
{
BinaryTree<T> tree;
Dictionary<BinaryTree<T>, BinaryTree<T>> parentMap;

  public BinaryTreeVersion(BinaryTree<T> tree)
{
this.tree = tree;
}

  public BinaryTree<T> Tree { get { return tree; } }
public BinaryTree<T> GetParentOf(BinaryTree<T> node)
{
if (parentMap == null)
{
parentMap = new Dictionary<BinaryTree<T>, BinaryTree<T>>();
FillParentMap(tree, null);
}
return parentMap[node];
}

  void FillParentMap(BinaryTree<T> tree, BinaryTree<T> parent)
{
if (tree == null)
return;
parentMap.Add(tree, parent);
FillParentMap(tree.Left, tree);
FillParentMap(tree.Right, tree);
}
}

The parent operation is exposed on the BinaryTreeVersion.  The parent of a given node is stored in the parentMap which is filled with node-parent pairs.  So now we have a tree that is immutable and the ability to access a node's parent.

But we can take it one step further, we can create functions that will modify the tree.  Note that we will not actually modify the tree but we will return a new BinaryTreeVersion.  The ChangeNodeValues function takes two functions:  a function that determines whether to modify the given node's value and a function to compute the node's new value (there are better ways to do this for specific modifications and yes I could have used a visitor).

class BinaryTreeVersion<T>
{

 ... 

  public BinaryTreeVersion<T> ChangeNodeValues(Func<BinaryTree<T>, bool> predicate, Func<BinaryTree<T>, T> newValue)
{
return new BinaryTreeVersion<T>(ChangeNodeValue(tree, predicate, newValue));
}

  BinaryTree<T> ChangeNodeValue(BinaryTree<T> node, Func<BinaryTree<T>, bool> predicate, Func<BinaryTree<T>, T> newValue)
{
if (node == null)
return null;
var left = ChangeNodeValue(node.Left, predicate, newValue);
var right = ChangeNodeValue(node.Right, predicate, newValue);
var value = predicate(node) ? newValue(node) : node.Value;
if (value.Equals(node.Value) && left == node.Left && right == node.Right)
return node;
return new BinaryTree<T>(value, left, right);
}
}

The neat thing is that since we use immutable trees, we can share subtrees that are not changed.  Even so, if another thread has a reference to the first BinaryTreeVersion or a node in the first tree then things will just work because we are mucking around with the data.

The Benefits of Purity

One of the most painful things in software development is integrating two or more things (objects, assemblies, functions, ...) together.  Invariably something will go wrong and a programmer that worked on one or the other things in question will protest, "It worked on my machine!"  Yes, indeed it probably did and it probably even works by itself but when put together with other things it doesn't play so nice.

One way to increase the reliability of a unit is to eliminate the side-effects.  This makes composing and integrating units together much easier and more robust.  Since they are side-effect free, they always work the same no matter the environment.  This is called referential transparency.

Conversely, things that are side-effect free are also easier to divide.  For a good example of this take a look at PLinq which farms out querying work to the available threads.  It divides up the work and then integrates the results: map-reduce.

But that is a discussion for another day...