Using the Roslyn Syntax API

Article
11/03/2011

By Kevin Pilch-Bisson

As promised back when we released the Roslyn CTP, here is the first of a series of blog posts that help explain the various parts of the Roslyn API. If you don’t have the Roslyn CTP, you can get it from our MSDN page, and install it on top of Visual Studio SP1 and the VSSDK. If you want to get into using the Roslyn CTP seriously, I highly recommend taking a look at the overview document, and the various walkthroughs that we’ve put together.

Properties

The Syntax API in Roslyn has three two key properties. First, it is a full-fidelity model. Every character of a source file is represented in the API, which means that it also round trips. You can parse a source file, call ToString on the resulting SyntaxTree and get the original file text back.

Key Types

The key types in the Syntax API of the Roslyn CTP are: SyntaxNode, SyntaxToken and SyntaxTrivia. SyntaxNode represents a non-terminal node in the language grammar, and is an immutable object. There are a variety of different derived types for different language elements, and also a Kind property that identifies what a particular SyntaxNode represents. SyntaxNodes are arranged into a tree, connected by the Parent and ChildNodes properties. Each type derived from SyntaxNode also includes named properties for element of that particular node type. For example, an IfStatementSyntax has properties for the IfToken, OpenParenToken, Condition, CloseParenToken, Statement, and ElseClauseOpt.

SyntaxToken objects represet the terminals of the language grammar. They never have children, but do have a Parent property that is always a SyntaxNode. For performance reasons, SyntaxToken is a struct type. You can use the Kind property to determine what sort of token something is.

Finally, things like whitespace and comments that can appear anywhere in a source file are represented as SyntaxTrivia. Each SyntaxTrivia is attached to a single token, as either LeadingTrivia or TrailingTrivia. SyntaxTrivia does not have a parent property, but it is possible to get the associated token via the Token property. Like SyntaxToken, SyntaxTrivia is a struct for performance reasons.

Parsing

Okay, enough talk, let’s see some code. First off, let’s see how we can get a hold of the SyntaxTree, and what it looks like. To start off with, you can parse some text using SyntaxTree.ParseCompilationUnit:

 using System;using System.Linq;using Roslyn.Compilers;using Roslyn.Compilers.CSharp;class Program{    static void Main(string[] args)    {        var syntaxTree = SyntaxTree.ParseCompilationUnit(@"using System;class C{    static void M()    {        if (true)            Console.WriteLine(""Hello, World!"");    }}");        var ifStatement = (IfStatementSyntax)syntaxTree.            Root.            DescendentNodes().            First(n => n.Kind == SyntaxKind.IfStatement);        Console.WriteLine("Condition is '{0}'.", ifStatement.Statement);    }}

In this example, we parse some simple text and look at the resulting SyntaxTree. In this case, we search through the syntax tree looking for IfStatements and grab the first one we find. We cast that the the more strongly typed IfStatementSyntax type, and then we can look at the properties specific to if statements. In this case, we looked at “Statement” which is a SyntaxNode that represents the statement to execute if the condition is true.

Construction

Sometimes you don’t want to parse existing source code to build up a SyntaxTree. Instead you want to create brand new SyntaxNodes. The most common case for this is that you want to add or change some existing syntax, and you need the new things to transform to. With the Roslyn CTP, the way you create new SyntaxNode objects is through the set of factory methods on the “Syntax” class.

 var newStatement = Syntax.ExpressionStatement(    Syntax.InvocationExpression(        Syntax.MemberAccessExpression(            SyntaxKind.MemberAccessExpression,            Syntax.IdentifierName("Console"),            name: Syntax.IdentifierName("WriteLine")),        Syntax.ArgumentList(            arguments: Syntax.SeparatedList<ArgumentSyntax>(                Syntax.Argument(                    expression: Syntax.LiteralExpression(                        SyntaxKind.StringLiteralExpression,                        Syntax.Literal(                            text: @"""Goodbye everyone!""",                            value: "Goodbye everyone!")))))));

Here, we’re constructing a new statement to represent a call to “Console.WriteLine("Goodbye everyone!");”. As you can see, this involves specifying quite a lot of detail. That’s one of the drawbacks of a full-fidelity model, it also means that you need specify all of those details when you create things. If you look closely, you can also see that in some places I’m using C# 4.0’s named arguments. The reason for this is that many parameters in the Syntax construction APIs are defined to be optional. The reason for optional parameters is that when possible, you don’t need to specify a detail. For example, many tokens are always exactly the same (like keywords, punctuation, etc). If you omit an argument for them, a default version will be constructed for you. If you’d like, you CAN specify every token, including triva for each one.

One thing you can do to reduce the verbosity is to sneak in calls to the various Syntax.Parse?? methods for elements of structure you are building up. This way, you can inline the code you want to have in a string instead of specifying the entire structure.

Rewriting

The next major piece of functionality in the Syntax API is re-writing, or modifying syntax trees. There are two approaches to this. The simplest is to just use one of the Replace APIs on SyntaxNode. The idea is that you get back a new tree, but with one of the descendants replaced.

 var newTree = syntaxTree.Root.ReplaceNode(    ifStatement.Statement,    newStatement).Format();Console.WriteLine(newTree);

We’ve successfully replace the body of the if statement. Because the trees are immutable, the original “SyntaxTree.Root” is completely unaffected. The newTree variable really is a brand new tree, but under the covers we actually re-use a lot of the data in an invisible fashion to be more efficient both in time and memory. Of note in the sample is the call to “Format()” on the new tree, which will apply a default formatting, and ensure that there is whitespace where there needs to be for the tree to round trip.

In some cases where you want to re-write multiple nodes, you might want more control over how the tree gets changed. To support this, the Roslyn CTP includes an abstract SyntaxRewriter class. I’ll give you a simple example that rewrites the tree above so that “M” is named “Main”, but there are obviously more interesting things that can be done here:

 class Rewriter: SyntaxRewriter{    protected override SyntaxToken VisitToken(SyntaxToken token)    {        if (token.Kind != SyntaxKind.IdentifierToken ||            token.GetText() != "M")        {            return token;        }        return Syntax.Identifier("Main");    }}

And you can use it like this:

 var rewriter = new Rewriter();newTree = rewriter.Visit(newTree);Console.WriteLine(newTree);

In this example, I’ve only used “VisitToken”, but there are actually a TON of “Visit” methods that you can override to rewrite specific types of nodes in a tree.

Closing

We’ve taken a brief look at some of the key concepts and types of the Syntax portion of the API in the Roslyn CTP, including parsing, constructing node from scratch, and rewriting existing nodes. Next time, we’ll move to the next stage of the compilation pipeline, and take a look at how Symbols are exposed in Roslyn.