Roslyn Primer – Part I: Anatomy of a Compiler

So, you’ve heard that VB (and C#) are open source now and you want to dive in and contribute. If you haven’t spent your life building compilers, you probably don’t know where to start. No worries, I’ll walk you through it. This post is the first of a series of blog posts focused on the Roslyn codebase. They’re intended as a primer for prototyping language features proposed on the VB Language Design repo; and contributing compiler and IDE features, and bug fixes on the Roslyn repo, both on GitHub. Despite the topic, these posts are written from the perspective of someone who’s never taken a course in compilers (I haven’t).

Phases of compilation

At a high level, here’s what happens:

  1. Scanning (also called lexing)
  2. Parsing
  3. Semantic Analysis (also called Binding)
  4. Lowering
  5. Emit

Some phases overlap and infringe on others a bit but that’s basically what the compiler is doing.

Compiling is a lot like reading

By analogy, when you read this blog post you look at a series of characters. You decide that some runs of letters form words, some is punctuation, some is whitespace. That’s what the scanner does. Then you decide that some punctuation groups things into a parenthetical, or a quotation, or terminates a sentence. Some dots are decimal points in numbers or abbreviations or initialisms. That’s what the parser does. Then you import your massive vocabulary of what words mean and look at all the words and decide what those words refer to and in combination what the sentences mean. Occasionally, you find a word with multiple meanings (overloaded terms) and you look at some amount of context to decide which of the multiple meanings is intended (like overload resolution). All of that assignment of meaning is semantic analysis. Lowering and emit don’t really have natural language equivalents other than perhaps translating from one language to another (think of it like translating an article from modern English to simplified English to another very primitive language).

But you’re way smarter than a compiler

Of course, you don’t do all of this one phase at a time. You don’t read a sentence in three passes because you can usually pick out words and sentences and their meaning all at once. But the compiler isn’t as smart as a human, so it does these things in phases to keep the problems simple. Every now and then, I get a bug report where someone says, “the compiler decided I meant that but obviously I meant this other thing because that doesn’t make any sense”. The compiler doesn’t know something doesn’t make sense until phase 3. And once it knows that, it can’t go back to phase 1 or 2 to correct itself (unlike you and me).

“Compiling” HelloWorld

Let’s go back to programming languages and look at what the compiler does to compile a simple program. The simple program just consists of the statement Call Console.WriteLine(“Hello, World!”)

Scanning

The Scanner runs over all the text in the files and breaks down everything into tokens:

  • Keyword – Call
  • Identifier – Console
  • Dot
  • Identifier – WriteLine
  • Left Parenthesis
  • String – “Hello, World!”
  • Right Parenthesis

These tokens are just like words and punctuation in natural languages. Whitespace isn’t usually important since it just separates tokens. But in VB, some whitespaces, like newlines, are significant and  interpreted as an “EndOfStatement” token.

Parsing

The Parser then looks at the list of tokens and sees how those tokens go together:

  • Parse a statement.
  • Look at the first token. Found a Call keyword. That starts a Call statement. Parse a Call statement.
  • A Call statement starts with the Call keyword and then an expression. Parse an expression.
  • Look at the next token. Found an identifier “Console”. That’s a name expression.
  • This might be part of a bigger expression. Look for things that could go after an identifier to make an even bigger expression.
  • Found a dot. An identifier followed by a dot is the beginning of a member access expression. Look for a name. Found another identifier “WriteLine”. This is a member access that says “Console.WriteLine”.
  • Still could be part of a bigger expression (maybe there are more dots after this?). Look for another continuing token.
  • Found a left parenthesis. You can’t just have a left parenthesis after an expression – this must be an invocation expression.
  • An invocation looks like an expression followed by an argument list. An argument list is a list of expressions (it’s more complicated than this but ignore that) separated by commas. Parse expressions and commas until you hit a right parenthesis.
  • Found a string literal expression. The argument list has one argument.

The parse produces a tree that looks like this:

  • CallStatement
    • CallKeyword
    • InvocationExpression
      • MemberAccessExpression
        • IdentifierName
          • IdentifierToken
        • DotToken
        • IdentifierName
          • IdentifierToken
        • ArgumentList
          • OpenParenthesisToken
          • SimpleArgument
            • StringLiteralExpression
              • StringToken
            • CloseParenthesisToken
          • EndOfFileToken

That’s a source file!

Semantic Analysis

To be clear, the compiler still has no idea (unlike you and I) that Console.WriteLine is a shared method on the Console class in the System namespace and that it has an overload that takes one string parameter and returns nothing. After all, anyone could make a class called Console. Maybe there isn’t a method called WriteLine. Maybe WriteLine is a type. That’s a dumb name for a type but the compiler doesn’t know that. If it is a type, then the program doesn’t make any sense. Piecing all of that together is semantic analysis.

The Binder looks at the references provided to the compiler: the namespaces, types, type members in those references, the project-level imports, and the imports in your source file. And then it starts figuring out what’s what.

  • What does Console mean?
    • Is there something called Console in scope?
      • Checking the containing block: No.
      • Checking the containing method: No.
      • Checking the containing type: No.
      • Checking the containing type’s base types: No.
      • Checking the containing type’s containing type or namespace: No.
      • Checking the containing namespace’s containing namespaces: No.
      • Are there import statements? No.
      • Are there project-level imports? Yes.
        • Check each namespace imported one by one.
        • Found one and only one? Yes.
      • Console is a type. This must be a shared member.
      • Look for shared member named WriteLine in [mscorlib]System.Console type.
      • Found 19 of them. They’re all methods.
      • Bind all the argument expressions.
        • One argument is a string literal. String literal has content “Hello, World!” and type of [mscorlib]System.String.
      • Based on number and types of the arguments, it checks how many of the 19 methods could take one string argument. In VB, the answer is 14. But there are rules that decide which ones are better and it turns out that the one that actually takes a string is better than the one that takes object, or the one that takes a string but passing an empty ParamArray argument list, or performing an implicit narrowing conversion to any of the numeric types, Boolean, or the intrinsic conversion from string to Char or Char array.

The compiler has determined that the program is an invocation of the shared void [mscorlib]System.Console::WriteLine(string) method. Passing the string literal “Hello, World”.

Lowering

What lowering does is take high-level language constructs that only exist in VB and translate them to lower-level constructs that the CLR/JIT compiler understands. Here are some examples of things that don’t exist at the Intermediate Language (IL) level:

  • Loops: IL only has goto—called “br” for branching—and conditional goto—br.true for branch when true and br.false for branch when false.
  • Variable scope: All variables are “in scope” for the entire method.
  • Using blocks: IL only has try/catch/finally so the compiler lowers a using block into a try/catch/finally block that initializes a variable and disposes of it in the finally block.
  • Lambda expressions: The compiler first translates lambdas into ordinary methods. If they capture any local variables, the compiler has to translate those variables into fields of an object behind the scenes.
  • Iterator methods: The compiler translates an Iterator method with Yield statements inside to a giant state machine, which is essentially just a giant Select Case that says, “last time you called me I was at step 1 so skip to step 2 this time”.

Even though IL has a much simpler set of instructions than a higher-level language like VB everything you can write in a VB program is ultimately composed of simple instructions. In the same way that the greatest works of English literature still use just 26 letters. All of the simplicity, safety, and expressiveness of a higher-level language is what makes VB so powerful.

This example of a simple call to a Shared method isn’t very complex. IL already understands method calls and string literals so there isn’t really any lowering to be done.

Emit

Emit is simple. Once the compiler digests your program into simple operations the CLR understands, it writes out these operations (usually to disk) into a binary file in a well-specified format.

Wrapping up

In this post, we looked at what a compiler does abstractly and how that process compares to how a human being might read a page of text. In the next post, we’ll dive into how the Visual Basic compiler specifically is organized.