Building a Simple Recursive Descent Parser (continued)

Article
08/03/2010

In this post, I further enhance the recursive descent parser that will parse the simple grammar that I presented previously in this series. I’ll examine some more classes that represent symbols in that grammar. The classes I present in this post further show how it’s possible to write LINQ code that closely parallels the grammar.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThis post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas. You can find the complete list of posts here.

In this version of the parser, I’ll enhance the parser to a certain extent, but it still will not be a full implementation of the simple grammar that I defined. In particular, I’m not going to implement infix operators until the next post. Infix operators are a bit more involved, particularly because we need to implement operator precedence. It is not a lot of code, but I use a few useful little LINQ tricks that deserve explanations.

The NospaceExpression Class

In the last version of the parser, I implemented a ‘dummy’ NospaceExpression class. For development and testing purposes, any sequence of symbols passed to that implementation of the NospaceExpression.Produce method were considered to make up a valid nospace-expression symbol.

In this post, I’ll simplify the grammar for the NospaceExpression symbol to the following, eliminating infix operators:

// nospace-expression = open-parenthesis expression close-parenthesis
// / numerical-constant
// / prefix-operator expression
~~// / expression infix-operator expression~~ (not implemented yet)

Here is this implementation of the NospaceExpression class:

public class NospaceExpression : Symbol
{
public static NospaceExpression Produce(IEnumerable<Symbol> symbols)
{
// nospace-expression = open-parenthesis expression close-parenthesis
// / numerical-constant
// / prefix-operator expression
~~// / expression infix-operator expression~~ (not implemented yet)

if (!symbols.Any())
return null;

// nospace-expression = open-parenthesis expression close-parenthesis
if (symbols.First() is OpenParenthesis && symbols.Last() is CloseParenthesis)
{
Expression e = Expression.Produce(symbols.Skip(1).SkipLast(1));
if (e != null)
return new NospaceExpression(new OpenParenthesis(), e, new CloseParenthesis());
}

// numerical-constant
NumericalConstant n = NumericalConstant.Produce(symbols);
if (n != null)
return new NospaceExpression(n);

// prefix-operator expression
PrefixOperator p = PrefixOperator.Produce(symbols.FirstOrDefault());
if (p != null)
{
Expression e = Expression.Produce(symbols.Skip(1));
if (e != null)
return new NospaceExpression(p, e);
}

return null;
}

public NospaceExpression(params Object[] symbols) : base(symbols) { }
}

Yellow code: The nospace-expression symbol is made up of three possible sets of constituent tokens. It could be made up of an open parenthesis, an expression, and a close parenthesis.

Green code: It could be made up of a numerical constant.

Cyan code: It could be made up of a prefix operator followed by a NospaceExpression.

If it is none of those three, then the Produce method returns null.

You can see that the C# / LINQ code to process the constituent symbols and produce a symbol has a direct correlation to the grammar.

The NumericalConstant Class

The next class I’ll examine in this post is the NumericalConstant class. One interesting characteristic of the grammar associated with this class is that it contains an optional symbol in the grammar, indicated by the square brackets:

// numerical-constant = [neg-sign] significand-part

This is described in the post on the Augmented Backus-Naur Form Grammar, which discusses the various notations that we need to pay attention to.

The NumericalConstant.Produce method first attempts to produce a SignificandPart symbol using the complete list of constituent symbols that was passed to it. If successful, the method can produce a NumericalConstant object with one constituent symbol, the SignificandPart object.

If SignificandPart fails to produce a symbol, then this method asks the NegSign.Produce method to produce a NegSign object, passing the first symbol in the collection to it. If that was successful, then the method asks the SignificandPart class to produce a symbol using a collection of all symbols except the first one (using the Skip extension method).

public class NumericalConstant : Symbol
{
public static NumericalConstant Produce(IEnumerable<Symbol> symbols)
{
// numerical-constant = [neg-sign] significand-part

SignificandPart s = SignificandPart.Produce(symbols);
if (s != null)
return new NumericalConstant(s);
NegSign n = NegSign.Produce(symbols.First());
if (n != null)
{
SignificandPart s2 = SignificandPart.Produce(symbols.Skip(1));
if (s2 != null)
return new NumericalConstant(n, s2);
}
return null;
}
public NumericalConstant(params Object[] symbols) : base(symbols) { }
}

One key point about the NegSign.Produce method: it will be asked to produce a symbol only for a single symbol, which will be a Minus symbol, so for simplicity, its Produce method takes a singleton Symbol instead of a collection. Here is the NegSign class:

public class NegSign : Symbol
{
public static NegSign Produce(Symbol symbol)
{
// neg-sign = minus

if (symbol is Minus)
return new NegSign(symbol);
return null;
}
public NegSign(params Object[] symbols) : base(symbols) { }
}

The SignificandPart Class

As a last example in this post, let’s take a look at the significand-part symbol, which is comprised of one of:

A fractional part
A whole-number-part followed by an optional fractional-part

// significand-part = fractional-part / whole-number-part [fractional-part]

To process this grammar, the first thing to do is to ask the FractionalPart class to produce a symbol, passing all of the symbols that were passed to SignificandPart.Produce. If the FractionalPart class can produce that symbol, then the method can return a SignificandPart object.

If the FractionalPart class can’t produce a symbol, then SignificandPart.Produce needs to ask the WholeNumberPart class to produce a symbol. However, if there is a fractional part following the WholeNumberPart, WholeNumberPart.Produce will not consume all of the constituent symbols. Therefore, WholeNumberPart.Produce has an additional out argument, which will contain any unprocessed symbols. If the unprocessed symbols collection is empty, then there is no fractional part, and SignificandPart can produce its symbol (which has no fractional part).

However, if the unprocessed symbols collection is not empty, then SignificandPart.Produce will ask FractionalPart.Produce to produce a FractionalPart with the unprocessed symbols, and if it does, then SignificandPart.Produce can produce its symbol.

This sounds more complicated than it is. Here is the implementation of the SignificandPart class:

public class SignificandPart : Symbol
{
public static SignificandPart Produce(IEnumerable<Symbol> symbols)
{
// significand-part = fractional-part / whole-number-part [fractional-part]

FractionalPart f;
f = FractionalPart.Produce(symbols);
if (f != null)
return new SignificandPart(f);
IEnumerable<Symbol> s = null;
WholeNumberPart w = WholeNumberPart.Produce(symbols, out s);
if (w != null)
{
if (!s.Any())
return new SignificandPart(w);
f = FractionalPart.Produce(s);
if (f != null)
return new SignificandPart(w, f);
}
return null;
}
public SignificandPart(params Object[] symbols) : base(symbols) { }
}

And here is the WholeNumberPart class, which potentially will not consume all of the symbols that are passed to its Produce method.

public class WholeNumberPart : Symbol
{
public static WholeNumberPart Produce(IEnumerable<Symbol> symbols,
out IEnumerable<Symbol> symbolsToProcess)
{
// whole-number-part = digit-sequence

IEnumerable<Symbol> unprocessedSymbols = null;
DigitSequence d = DigitSequence.Produce(symbols, out unprocessedSymbols);
if (d != null)
{
symbolsToProcess = unprocessedSymbols;
return new WholeNumberPart(d);
}
symbolsToProcess = null;
return null;
}
public WholeNumberPart(params Object[] symbols) : base(symbols) { }
}

It, in turn, is comprised of a DigitSequence symbol, which is interesting in that it provides another example of how the LINQ code parallels the grammar. A DigitSequence is comprised of one or more DecimalDigit symbols (which are terminals). The LINQ code essentially says: query for the first n DecimalDigit symbols in the sequence. If there are any, then return a DigitSequence symbol, and set the out parameter to the unprocessed symbols. If there aren’t any, then return null.

public class DigitSequence : Symbol
{
public static DigitSequence Produce(IEnumerable<Symbol> symbols,
out IEnumerable<Symbol> symbolsToProcess)
{
// digit-sequence = 1*decimal-digit

IEnumerable<Symbol> digits = symbols.TakeWhile(s => s is DecimalDigit);
if (digits.Any())
{
symbolsToProcess = symbols.Skip(digits.Count());
return new DigitSequence(digits);
}
symbolsToProcess = null;
return null;
}
public DigitSequence(params Object[] symbols) : base(symbols) { }
}

In the next post, I’ll dive into processing infix operators, and how we can write succinct code to implement operator precedence.

Building a Simple Recursive Descent Parser (continued)

The NospaceExpression Class

The NumericalConstant Class

The SignificandPart Class

Additional resources