Writing a Recursive Descent Parser using C# and LINQ

Recursive descent parsers are one of the easier types of parsers to implement.  Given a properly defined grammar, you write a class for each production in the grammar, and you write one fairly simple method in each class.  Each of those methods returns a ‘production’ based on its source productions (tokens) passed to it.  Once you understand the pattern for writing those methods, and how to translate the grammar to the pattern, it’s possible to code those methods just about as fast as you can type.  In those methods, you can identify errors and throw exceptions as necessary.  After parsing, you have a syntax tree (sometimes called a parse tree) that you can examine or modify.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThis post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.  You can find the complete list of posts here.

For a typical professional developer, there are lots of benefits to understanding grammars, recursive descent parsers, and syntax trees.  At one point in my process of learning C#, I read the specification of the C# language, which includes the grammar for the language.  Understanding the grammar, and correlating the grammar to the text verified that I understood each construct, and reduced the chance that I had any misconceptions about the exact semantics of the language.

Going through the process of writing a small recursive descent parser is very helpful too.  This makes sure that you fully understand how grammars are put together, and what they mean.

Finally, as a professional developer, you may come across situations where this information is useful.  Perhaps you need to develop a small domain-specific language (DSL).  There are a number of approaches to developing DSLs, each appropriate in their own situation, and there are a few situations where it is most appropriate to define a grammar and write a parser.

As an Open XML developer, understanding recursive descent parsers is useful.  Formulas in a spreadsheet are stored as strings.  Without parsing those strings, those formulas are opaque.  We can’t examine a formula programmatically and determine what it is doing or what cells it references.  Fortunately, with Open XML, we have a grammar for those formulas.  J

As I was contemplating implementing some functionality (searching for all formulas across a range of workbooks and worksheets that reference a specific cell), I started contemplating writing a recursive descent parser in C# and LINQ.  It promised to be a super-fun programming project.  There are some huge benefits that we gain by writing the parser using LINQ.  Many of those methods that we need to write in the production classes are significantly simplified by the use of LINQ.

Just to make it clear why we need this parser:

In order to examine references to cells in formulas, we must parse those formulas according to the grammar.

For instance, if we want to search for all references to cell A3, we may see the following formula in cell A1:

=A2+A3

If searching for references to A3, this formula should be a match.

In this same spreadsheet, there may be another cell with a name of “ZZZZA3”, and that name could be used in the formula:

=A2+ ZZZZA3

This formula should not be a match.

There are practical reasons for writing this code.  As an Open XML developer, I don’t like it that formulas in cells are opaque to me.  I want to be able to look inside them and process Open XML spreadsheets in new and interesting ways.

However, the most important reason to write these posts is for the sheer, absolute fun of it.  This whole project will be fun.  It will be fun explaining how recursive descent parsers work.  It will be fun to explore how LINQ makes it easy to write recursive descent parsers.  And it will be fun to be able to query Open XML spreadsheets in cool ways.