Building a Simple Recursive Descent Parser

In this post, I present the start of a recursive descent parser that will parse the simple grammar that I presented previously in this series.  I’ll point out some key features of the code so that it is easy to see how the code works.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThis post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.  You can find the complete list of posts here.

The Symbol Class

We need to write a class for each symbol (production) in the grammar.  Most of these classes have a method named Produce that given a collection of Symbol objects, produces an instance of that symbol.  Each Produce method in the various classes is free to return null if the collection that is passed to the method can’t produce that symbol.

We need to define an abstract base class for each of these classes.  Most instances of classes that derive from the abstract base class will contain a list of constituent symbols, so the abstract base class includes a public property, ConstituentSymbols, of type List<Symbol>.

The ToString method for each of these classes returns the string representation of the symbol.  If a symbol is comprised of a list of constituent symbols (i.e. is not a terminal), then the ToString method returns the concatenated strings of the constituent symbols.

There is a constructor for Symbol that takes a params array of Object.  Each element in the params array can either be an instance of Symbol, or an object that implements IEnumerable<Symbol>.  Those are the only valid types in the params array.  The constructor throws an internal ParserException if anything other than one of those types is passed as an argument.  I explained this idiom in the previous post in this series, Creating a Collection from Singletons and Collections using LINQ.  This method also uses the StringConcatenate extension method, which I discussed in this topic in the LINQ to XML documentation.

public abstract class Symbol
{
public List<Symbol> ConstituentSymbols { get; set; }
public override string ToString() {
string s = ConstituentSymbols.Select(ct => ct.ToString()).StringConcatenate();
return s;
}
public Symbol(params Object[] symbols)
{
List<Symbol> ls = new List<Symbol>();
foreach (var item in symbols)
{
if (item is Symbol)
ls.Add((Symbol)item);
else if (item is IEnumerable<Symbol>)
foreach (var item2 in (IEnumerable<Symbol>)item)
ls.Add(item2);
else
// If this error is thrown, the parser is coded incorrectly.
throw new ParserException("Internal error");
}
ConstituentSymbols = ls;
}
public Symbol() { }
}

The OpenParenthesis Class

There are two varieties of symbols – terminal, and non-terminal.  I discussed both types of symbols in the post on the Augmented Backus-Naur Form Grammar.  A subclass of the Symbol class for terminal symbols is super simple.  It contains an override of the ToString method, which returns the string of the terminal.  Due to the semantics of C#, it also needs to include a constructor with no parameters:

public class OpenParenthesis : Symbol
{
public override string ToString() { return "("; }
public OpenParenthesis() { }
}

All of the terminals except for DecimalDigit look like the OpenParenthesis class.  DecimalDigit is similar, except that its constructor takes a character, and the ToString method returns that character.  To make the code as succinct as possible, digits are stored as strings instead of characters.

public class DecimalDigit : Symbol
{
private string CharacterValue;
public override string ToString() { return CharacterValue; }
public DecimalDigit(char c) { CharacterValue = c.ToString(); }
}

The Formula Class

A derived class for non-terminal symbols is also pretty simple.  There is one public method, Produce, which takes a collection of Symbol objects, and then produces an instance of the class if it can; otherwise it returns null.

Recursive descent parsers are ‘top-down’ parsers, so it makes sense to define the Formula class first.  If you examine the simple grammar that I defined, you can see that a formula is comprised of an expression, so the Formula.Produce method simply passes on the collection of Symbol objects to the Expression.Produce method.  Again, due to the semantics of C#, it is necessary to define the constructor that takes a params array of Object, but it doesn’t need to do anything other than call the constructor in the base class.

class Formula : Symbol
{
public static Formula Produce(IEnumerable<Symbol> symbols)
{
// formula = expression

Expression e = Expression.Produce(symbols);
return e == null ? null : new Formula(e);
}
public Formula(params Object[] symbols) : base(symbols) { }
}

The Expression Class

The Expression class is more interesting.  The grammar production for Expression is as follows:

expression = *whitespace nospace-expression *whitespace

This means that an expression consists of zero or more WhiteSpace symbols, followed by one and only one NospaceExpression, followed by zero or more WhiteSpace symbols.  Using LINQ, it is super easy to count the WhiteSpace symbols at the beginning and end of the collection of symbols.  The Expression.Produce method can then pass the symbols in the middle (between the white space) to NospaceExpression.Produce.  If that method returns a NospaceExpression object, then the method can instantiate an Expression object and return it.  The Expression.Process method makes use of the SkipLast extension method.  The Expression class looks like this:

public class Expression : Symbol
{
public static Expression Produce(IEnumerable<Symbol> symbols)
{
// expression = *whitespace nospace-expression *whitespace

int whiteSpaceBefore = symbols.TakeWhile(s => s is WhiteSpace).Count();
int whiteSpaceAfter = symbols.Reverse().TakeWhile(s => s is WhiteSpace).Count();
IEnumerable<Symbol> noSpaceSymbolList = symbols
.Skip(whiteSpaceBefore)
.SkipLast(whiteSpaceAfter)
.ToList();
NospaceExpression n = NospaceExpression.Produce(noSpaceSymbolList);
if (n != null)
return new Expression(
Enumerable.Repeat(new WhiteSpace(), whiteSpaceBefore),
n,
Enumerable.Repeat(new WhiteSpace(), whiteSpaceAfter));
return null;
}

public Expression (params Object[] symbols) : base(symbols) { }
}

I follow the same pattern for every class – the grammar is included in a comment, followed by a bit of LINQ code to produce an instance of the class, returning null if necessary.

Let’s look at the code to instantiate the Expression object:

return new Expression(
Enumerable.Repeat(new WhiteSpace(), whiteSpaceBefore),
n,
Enumerable.Repeat(new WhiteSpace(), whiteSpaceAfter));

The arguments in yellow are collections of Symbol.  The parameter n, which is of type NospaceExpression, is a singleton, so we’re taking advantage of the ability to pass either singletons or collections to the Expression constructor.

The NospaceExpression Class

To keep this first example as simple as possible, we’ll implement a dummy NospaceExpression class.  We’ll pretend that any collection of symbols is a valid NospaceExpression symbol.  In the next post, I’ll examine how we need to code the NospaceExpression class, as well as classes for some of the other symbols.

public class NospaceExpression : Symbol
{
public static NospaceExpression Produce(IEnumerable<Symbol> symbols)
{
return new NospaceExpression(symbols);
}

public NospaceExpression(params Object[] symbols) : base(symbols) { }
}

Projecting a Collection of Symbols from a String

One aspect of the approach that I took is that I first projected a collection of terminal Symbol objects from the collection of characters that make up the string being parsed.  The string class implements IEnumerable<char> so we can do a simple select, as follows:

IEnumerable<Symbol> symbols = s.Select(c =>
{
switch (c)
{
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
return new DecimalDigit(c);
case ' ':
return new WhiteSpace();
case '+':
return new Plus();
case '-':
return new Minus();
case '*':
return new Asterisk();
case '/':
return new ForwardSlash();
case '^':
return new Caret();
case '.':
return new FullStop();
case '(':
return new OpenParenthesis();
case ')':
return new CloseParenthesis();
default:
return (Symbol)null;
}
});

If we parse the formula " (1+3)  ", and dump out the terminal symbols, we see this:

Terminal Symbols
================
WhiteSpace > <
OpenParenthesis >(<
DecimalDigit >1<
Plus >+<
DecimalDigit >3<
CloseParenthesis >)<
WhiteSpace > <
WhiteSpace > <

By first transforming the string into a collection of terminals, it allows us to write more expressive code.  It is easier to read this code:

If (t is WhiteSpace)

It is a bit harder to read this:

If (t is Character && t.ToString() == " " )

The DumpSymbolRecursive Method

As you can see, Symbol objects form a hierarchical tree – each symbol (except for terminals) has constituent symbols.  This allows us to write a recursive method to dump symbols to a StringBuilder object.

public static void DumpSymbolRecursive(StringBuilder sb, Symbol symbol, int depth)
{
sb.Append(string.Format("{0}{1} >{2}<",
"".PadRight(depth * 2),
symbol.GetType().Name.ToString(),
symbol.ToString())).Append(Environment.NewLine);
if (symbol.ConstituentSymbols != null)
foreach (var childSymbol in symbol.ConstituentSymbols)
DumpSymbolRecursive(sb, childSymbol, depth + 1);
}

If we parse the formula “ (1+3)  ” and dump it, we see:

Formula > (1+3) <
Expression > (1+3) <
WhiteSpace > <
NospaceExpression >(1+3)<
OpenParenthesis >(<
DecimalDigit >1<
Plus >+<
DecimalDigit >3<
CloseParenthesis >)<
WhiteSpace > <
WhiteSpace > <

We can see the instances of the Formula, Expression, and NospaceExpression classes, as well as the various terminals that make up the expression.

Complete Listing

Here is the complete listing of the first version of the SimpleFormulaParser example.  You can simply cut and paste this code into Visual Studio, and run it.  It doesn’t have any dependencies.  It transforms a formula to a collection of terminals, prints the terminals, and then uses the Formula, Expression, and NospaceExpression classes to produce a parse tree (which is incomplete because NospaceExpression is a dummy implementation).

In the next post in this series, I’ll continue to develop this simple parser.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SimpleParser
{
public abstract class Symbol
{
public List<Symbol> ConstituentSymbols { get; set; }