DLR Trees (Part 1)

I'd like to take a detour from discussing the DLR's type system to start talking about implementation. I hope that this will make it easier to talk about some difficult questions like what is the right meta-model for the DLR type system. Let's start by compiling the traditional factorial function for a DLR-based language - using JavaScript for today.

 function factorial(n) {
    if (n <= 1) {
        return 1;
    } else {
        return n * fact(n - 1);
    }
}

The JavaScript compiler will take this code and compile it into the following tree form and pass the tree to the DLR. 

The key implementation trick in the DLR is using these kinds of trees to pass code around as data and to keep code in an easily analyzable and mutable form as long as possible. Anyone who's programmed in Lisp or Scheme knows this trick extremely well. Much more recently, this idea has resurfaced as one of the central features in C# 3 and VB.NET 9 that enable LINQ. Linq uses these "expression trees" to capture the semantics of a given block of code in a sufficiently abstract way that it can be optimized to run in different contexts. For example, the DLINQ SQL interfaces can convert that code into a SQL query that will be optimized by the database engine on the back-end. More radical projects like PLINQ are exploring how to take these same kinds of trees and rearrange the code to execute in parallel and on multiple cores when possible.

When we started building the DLR we knew that we wanted a shared tree form, but we believed that we wanted purely untyped and late-bound nodes in the tree that would always use dynamic actions for their operations. This matched what we had for IronPython where the only typed node in the tree was the ConstantExpression - which had the type of whatever its constant value was. Because our trees were untyped, we didn't believe that they shared much in common with the always strongly typed expression trees in C# 3 and we went about designing these trees as something completely independent.

As we started adding more language implementations, we found that it was difficult to capture different languages idiosyncratic concepts in the DLR tree. Examples of this range from Python's print statement to JavaScript's regular expression literals. We needed to support these language features, but we couldn't add explicit support for them into our trees.  Well, clearly we could have added explicit support for them in the tree, but where would that have ended? Ultimately, if the DLR is going to be successful it will have to be general enough that language implementations can represent their own concepts clearly even if there is no direct DLR equivalent.

For Python's print statement, we knew that IronPython already compiled this into a call to a static method that handled all of the implementation details of printing in Python - including the dreaded softspace. We could support this by adding a static method call node to the DLR tree and by converting the Python print statement into a static method call to this helper method. Similarly, when we went to support the special Python constant for an ellipsis, "...", we knew that the existing IronPython implementation compiled this into a static field access of the singleton ellipsis instance. We could support this by adding a static field access node to the DLR tree and converting the Python ellipsis instance into one of these nodes.

Pretty soon we realized that the nodes that we were adding mapped directly onto the statically typed and bound expression tree nodes from C# 3. This was a good thing, since it meant that there was less for us to create from scratch and more opportunity for us to steal from existing ideas. The DLR trees today in theory include all of the expression tree nodes from C# 3 and add to them several additional kinds of nodes to handle dynamic operations, variables and name binding, and control flow. If you look at the code you'll notice that in fact we only support a subset of the C# 3 nodes and that our APIs don't exactly match the existing expression trees. This is simply an artifact of our late realization that these trees were so closely related and we're now in the process of reconciling our trees with the existing expression trees.

The value of having dynamic languages target this tree form is that we can perform lots of optimizations in the DLR layer on behalf of the language implementations. Some of these optimizations include variable storage where we can often optimize local variables as CLR local variables that can be JITed into machine registers - but where we can also detect explicit meta-programming such as Python's locals() function or JavaScript's arguments feature and fall back to slower and more explicit storage mechanisms such as tuples or object fields when needed. Similarly for the dynamic action nodes we can use different implementation techniques to optimize these important but potentially slow operations.

This post is already too long and it's been over a week since my last post.  I'm going to finish up for today and save any more details about the optimizations we perform on these trees for the future. For those of you looking at code, the DLR trees can be found in Microsoft.Scripting.Ast.