Monadic Parser Combinators using C# 3.0

Parser combinators are an idea that I enjoy every time I go back and look at again.  They are an approach to building parsers by composing very simple atomic parsers into bigger and bigger units which can ultimately express real world grammars.  This idea has been particularly popular in functional languages where the parsers can naturally be thought of as functions from input strings to parse trees, and composition of parsers is just function composition.  This approach often leads to a simple syntax which makes the resulting parsers pleasantly declarative in that internal-DSL kind of way. 

There's been plenty of great research and implementation of parser combinators over the last 15 years.  I can't describe it all, or even list all the references here - so let me just point you to a couple of fun papers co-authored by my Microsoft colleagues Erik Meijer and Daan Leijen.  These papers also include many references out to the broader set of research papers on the topic.

Building a Parser Combinator Framework in C#3.0

The code below is some sample code I wrote shortly after getting ahold of one of the first C#3.0 prototype compilers a couple years ago.  It uses some of the new languages features in interestingly unusual ways, which I'll highlight as I go along.  For reference, it is most directly influenced by the great parser combinator example given near the end of Scala by Example.

Step 1: A Type for Representing Parsers
 using System;
using System.Linq;
using System.Text;

// The result of a parse consists of a value and the unconsumed input.
public class Result<TInput, TValue> {
    public readonly TValue Value;
    public readonly TInput Rest;
    public Result(TValue value, TInput rest) { Value = value; Rest = rest; }
}

// A Parser is a delegate which takes an input and returns a result.
public delegate Result<TInput, TValue> Parser<TInput, TValue>(TInput input);

Taking a cue from the functional language implementations, our type for representing parsers will be a delegate type.  This naturally captures the idea that a parser is a function which takes in some input, and returns a value as well as the unconsumed part of the input.

Step 2: Provide Infix AND and OR Operations on Parsers
 public static class ParserCombinatorExtensions {
    public static Parser<TInput, TValue> OR<TInput, TValue>(
        this Parser<TInput, TValue> parser1, 
        Parser<TInput, TValue> parser2) 
    {
        return input => parser1(input) ?? parser2(input);
    }
    public static Parser<TInput, TValue2> AND<TInput, TValue1, TValue2>(
        this Parser<TInput, TValue1> parser1, 
        Parser<TInput, TValue2> parser2) 
    {
        return input => parser2(parser1(input).Rest);
    }
}

Before building any actual parsers, we'll just come up with ways to combine them.  There are two obvious ways.  Try the first one, and if it fails, apply the second one.  This corresponds to alternate productions in a BNF grammar.  Or we could apply the first one, and then apply the second one to whatever input is left.  This corresponds to sequencing (juxtaposition) in BNF grammars. 

Note that these are extension methods applied to the Parser type, which is a delegate.  This is something we couldn't have done in C# prior to C# 3.0, because there was no way to add methods to a delegate type.  The "infix" syntax we get by making these extension methods instead of plain static methods is very important for the readability of the resulting parsers, where ordering is critical, as the example at the end of this post will show.

Step 3: Support Composing Parsers using C# Query Expressions
 public static class ParserCombinatorsMonad {
    // By providing Select, Where and SelectMany methods on Parser<TInput,TValue> we make the 
    // C# Query Expression syntax available for manipulating Parsers.  
    public static Parser<TInput, TValue> Where<TInput, TValue>(
        this Parser<TInput, TValue> parser, 
        Func<TValue, bool> pred) 
    {
        return input => { 
                   var res = parser(input); 
                   if (res == null || !pred(res.Value)) return null;
                   return res;
               };
    }
    public static Parser<TInput, TValue2> Select<TInput, TValue, TValue2>(
        this Parser<TInput, TValue> parser, 
        Func<TValue, TValue2> selector) 
    {
        return input => { 
                   var res = parser(input);
                   if (res == null) return null;
                   return new Result<TInput, TValue2>(selector(res.Value), res.Rest);
               }; 
    }
    public static Parser<TInput, TValue2> SelectMany<TInput, TValue, TIntermediate, TValue2>(
        this Parser<TInput, TValue> parser, 
        Func<TValue, Parser<TInput, TIntermediate>> selector, 
        Func<TValue, TIntermediate, TValue2> projector)
    {
        return input => {
                   var res = parser(input);
                   if (res == null) return null;
                   var val = res.Value;
                   var res2 = selector(val)(res.Rest);
                   if (res2 == null) return null;
                   return new Result<TInput, TValue2>(projector(val, res2.Value), res2.Rest);
               };
    }
}

This is where the fun really starts!  Any type which implements Select, SelectMany and Where methods supports (part of) the "query pattern" which means we can write C#3.0 queries including multiple froms, an optional where clause and a select clause to process objects of this type.  This can make it a *lot* easier to combine parsers, and makes the syntax more transparent.  That the parsers we are using here support these three operations is captured in the idea that this parser type is a monad - something very much in LINQ's blood - which I hope to write more about in a future post.

Step 4: The Fundamental Parser Combinator Building Blocks
 // Contains all the basic parsers that are independent of return type.
public abstract class Parsers<TInput> {
    public Parser<TInput, TValue> Succeed<TValue>(TValue value) {
        return input => new Result<TInput, TValue>(value, input);
    }
    public Parser<TInput, TValue[]> Rep<TValue>(Parser<TInput, TValue> parser) {
        return Rep1(parser).OR(Succeed(new TValue[0]));
    }
    public Parser<TInput, TValue[]> Rep1<TValue>(Parser<TInput, TValue> parser) {
        return from x in parser 
               from xs in Rep(parser)
               select (new[] { x }).Concat(xs).ToArray();
    }
}

// Contains a few parsers that parse characters from an input stream
public abstract class CharParsers<TInput> : Parsers<TInput> {
    public abstract Parser<TInput, char> AnyChar { get; }
    public Parser<TInput, char> Char(char ch) { 
        return from c in AnyChar where c == ch select c; 
    }
    public Parser<TInput, char> Char(Predicate<char> pred) { 
        return from c in AnyChar where pred(c) select c; 
    }
}

So now we need to write some (almost) real parsers.  The Succeed parser always succeeds with the provided result without consuming any input.  Rep and Rep1 apply a parser repeatedly, they differ only in whether they require the parser to parse at least once.  Note that Rep1 uses Query Expressions to capture the idea of parsing using "parser", then parsing using "Rep(parser)" then returning a result built out of the two individual results. 

The AnyChar and Char parsers parse actual characters out of the input stream.  AnyChar parses any single character, whereas the two overloads of Char only accept certain classes of characters.  Note that AnyChar is defined abstract because it can only be implemented once we've decided what kind of input we're going to parse - but we want to be able to build on it without committing to a particular input type.

Step 5: An Example Parser for a MiniML Language

 // Term and its derived classes define the AST for terms in the MiniML language.
public abstract class Term { }
public class LambdaTerm : Term { public readonly string Ident; public readonly Term Term; public LambdaTerm(string i, Term t) { Ident = i; Term = t; } }
public class LetTerm : Term { public readonly string Ident; public readonly Term Rhs; public Term Body; public LetTerm(string i, Term r, Term b) { Ident = i; Rhs = r; Body = b; } }
public class AppTerm : Term { public readonly Term Func; public readonly Term[] Args; public AppTerm(Term func, Term[] args) { Func = func; Args = args; } }
public class VarTerm : Term { public readonly string Ident; public VarTerm(string ident) { Ident = ident; } }

// Provides a set of parsers for the MiniML Language defined above.  
public abstract class MiniMLParsers<TInput> : CharParsers<TInput>{
    public MiniMLParsers() {
        Whitespace = Rep(Char(' ').OR(Char('\t').OR(Char('\n')).OR(Char('\r'))));
        WsChr =  chr => Whitespace.AND(Char(chr));
        Id =     from w in Whitespace
                 from c in Char(char.IsLetter)
                 from cs in Rep(Char(char.IsLetterOrDigit))
                 select cs.Aggregate(c.ToString(),(acc,ch) => acc+ch);
        Ident =  from s in Id where s != "let" && s != "in" select s;
        LetId =  from s in Id where s == "let" select s;
        InId =   from s in Id where s == "in" select s;
        Term1 = (from x in Ident 
                 select (Term)new VarTerm(x))
                .OR(
                (from u1 in WsChr('(') 
                 from t in Term 
                 from u2 in WsChr(')') 
                 select t));
        Term =  (from u1 in WsChr('\\')
                 from x in Ident
                 from u2 in WsChr('.')
                 from t in Term
                 select (Term)new LambdaTerm(x,t))
                .OR(
                (from letid in LetId
                 from x in Ident
                 from u1 in WsChr('=')
                 from t in Term
                 from inid in InId
                 from c in Term
                 select (Term)new LetTerm(x,t,c)))
                .OR(
                (from t in Term1
                 from ts in Rep(Term1)
                 select (Term)new AppTerm(t,ts)));
        All =    from t in Term from u in WsChr(';') select t;
    }
   
    public Parser<TInput, char[]> Whitespace;
    public Func<char, Parser<TInput, char>> WsChr;
    public Parser<TInput, string> Id;
    public Parser<TInput, string> Ident;
    public Parser<TInput, string> LetId;
    public Parser<TInput, string> InId;
    public Parser<TInput, Term> Term;
    public Parser<TInput, Term> Term1;
    public Parser<TInput, Term> All;
}

Here we've used the parsers built so far to implemented a parser for a (toy) language called MiniML.  You can immediately see that the parser is structured much like the BNF it implements - which is the internal-DSL concept I mentioned above.  You can also see how C# Query Expressions are used to help with this - in a context that is very different than their standard query usage.

Step 6: Parse Something!
 public class MiniMLParserFromString: MiniMLParsers<string> {
    public override Parser<string, char> AnyChar {
        get { { return input => input.Length > 0 ? new Result<string,char>(input[0], input.Substring(1)) : null; } }
    }   
}
class Program {
    static void Main(string[] args) {
        MiniMLParserFromString parser = new MiniMLParserFromString();
        Result<string,Term> result = 
            parser.All(@"let true = \x.\y.x in 
                         let false = \x.\y.y in 
                         let if = \b.\l.\r.(b l) r in
                         if true then false else true;");
    }
}

We finally have something we can execute!  MiniMLParserFromString makes everything concrete by describing how to parse characters from a string input.  And then in the Main method we see an example which parsers a little example expression from the MiniML language.

Conclusion

The little parser combinator framework here is very much a toy implementation.  It's horribly inefficient, doesn't do error reporting, and doesn't support any form of error recovery.  But it's a fun 150 lines of code!  If you are interested in the more practical applications of parser combinators - these papers provide a lot of good ideas:

And here's the code:

File iconParserCombinators.cs