A Java to x86 compiler written in F#

A little personal information about me, I’m currently working on a masters degree at the University of Washington. And a few weeks ago I completed taking a course in Compiler Construction, where the course project was to build a simple Java compiler that generated x86 executables.

 

While most students chose to write their compilers in Java or C#, my partner and I wrote ours in F#. Though I won’t be publishing the source code - since it would be too tempting for other students using Appel’s Modern Compiler Implementation in Java – I would like to point out where F# makes the task of writing a compiler much easier than say, C#. F#’s superiority in this domain can largely be attributed to discriminated unions, as you will soon find out.

 

Abstract Syntax Tree, Parsers, and Lexer

The first phase of the compiler was perhaps the easiest to implement in F#. Rather than representing some complex object hierarchy and use polymorphism, the AST can be represented as a series of discriminated unions.

Ast.fs snippet

type Statement =

    | CompoundStatement of Statement list

    | If of Expression * Statement * Statement // If ([expr]) [stmt1] else [stmt2]

    | While of Expression * Statement

    | PrintLine of Expression

    | DeclarationOrAssignment of VariableOrType * DeclarationOrAssignmentTail

type Ast =

    | MiniJavaProgram of Ast * Ast list // MainClass, ClassDeclaration list

    | MainClass of string * Ast // Name, main method

    | MainMethod of string * Type * Statement // Name of parameter, return type, Statement

    | Class of string * Ast list // Name, VarDeclaration or Method

    | DerivedClass of string * string * Ast list // Name, BaseClass, VarDeclaration or Method

    | VariableDeclaration of string * Type // Type, name

    | Method of string * Type * Ast list * Ast // Name, (return) Type, (parameter) Types, MethodBody

    | MethodBody of Statement list * Expression // Statements, (return) Expression

Parser.fsy snippet

// These are the rules of the grammar along with the F# code of the

// actions executed as rules are reduced. In this case the actions

// produce data using F# data construction terms.

start: MiniJavaProgram { $1 }

MiniJavaProgram:

      | MainClass { MiniJavaProgram($1, []) }

      | MainClass ClassDeclarations { MiniJavaProgram($1, List.rev $2)}

MainClass:

      | CLASS ID LCURLY MainMethod RCURLY { MainClass($2, $4) }

MainMethod:

      | PUBLIC STATIC VOID MAIN LPAREN STRING LBRACKET RBRACKET ID RPAREN LCURLY Statement RCURLY

                                                { MainMethod($9, MainMethodType, $12) }

ClassDeclarations:

      | ClassDeclaration { [$1] }

      | ClassDeclarations ClassDeclaration { $2 :: $1 }

ClassDeclaration:

      | CLASS ID LCURLY RCURLY { Class($2, []) }

      | CLASS ID EXTENDS ID LCURLY RCURLY { DerivedClass($2, $4, []) }

      | CLASS ID LCURLY ClassMakeupDeclarations RCURLY { Class ($2, (List.rev $4)) }

      | CLASS ID EXTENDS ID LCURLY ClassMakeupDeclarations RCURLY { DerivedClass ($2, $4, ($6))}

 

Type Checking and Symbol Table Generation

Once we were able to successfully lex and parse source code, we moved onto type checking and generating a symbol table. Like most compilers, we simply used the Visitor design pattern . Which if implemented in C# requires some funky notion of double dispatch. That is, the object being visited has a method called ‘visit’ or ‘accept’ which is given a parameter of type Visitor, then, the object being visited calls ‘visit’ on the visitor passing ‘this’ as parameter.

 

The problem with this is that it adds a lot of lame boiler plate code. For example, from the Wikipedia article each ‘part’ needs to implement the ‘accept’ method, which in turn calls the ‘visit’ method on the visitor. Via method overloading, the right visitor method will be called.

class Engine {

    public void accept(Visitor visitor) {

        visitor.visit(this);

    }

}

 

class Body {

    public void accept(Visitor visitor) {

        visitor.visit(this);

    }

}

 

In F# with discriminated unions we can write a single ‘walker’ for the discriminated union types, and keep all the visitation code in one place. (Rather than decorating each node.) This keeps the code much simpler.

Visitor.fs snippet

type Walker =

    // Walks the ast tree with the given visitor

    static member WalkTree (astRoot:Ast, visitor:Visitor) =

        let walker = new Walker(visitor)

        walker.WalkAst(astRoot)

   

    // ... snip ...

    // Methods which walk the AST node, calling our visitor methods as nessecary.

    member private this.WalkAst (astNode : Ast) =

        // Pre visit

        visitor.PreVisitAst astNode

       

        let walkAstList (astList : Ast list) =

            astList |> List.iter (fun nast -> this.WalkAst nast)

       

        let walkStmtList (stmtList : Statement list) =

            stmtList |> List.iter (fun nstmt -> this.WalkStatement nstmt)

       

        // Recurse

        let _ = match astNode with

                | MiniJavaProgram (ast, astList) -> this.WalkAst ast; walkAstList astList

       | MainClass (_, ast) -> this.WalkAst ast

                | MainMethod (_, _, stmt) -> this.WalkStatement stmt

               

                | Class (_, astList) -> walkAstList astList

                | DerivedClass (_, _, astList) -> walkAstList astList

               

                | VariableDeclaration (_, varType) -> this.WalkType varType

                | Method (_, retType, astList, ast) -> this.WalkType retType; walkAstList astList; this.WalkAst ast

                | MethodBody (stmtList, expr) -> this.WalkExpression expr; walkStmtList stmtList

       

        // Post visit

        visitor.PostVisitAst astNode

Executable Creation

Generating the assembly code is straightforward by walking your IR nodes and building up a list of ‘Instructions’, represented by the following DU:

type Instruction =

    | MovRR of Register * Register

    | MovRI of Register * int

    | MovRM of Register * Memory

    | MovMA of Memory * string

    | MovMR of Memory * Register

    | Comment of string
| ...
override this.ToString() =

        match this with

        | MovRR(dstReg, srcReg) -> " mov " + dstReg.ToString() + ", " + srcReg.ToString()

        | MovRI(dstReg, i) -> " mov " + dstReg.ToString() + ", " + i.ToString()

        | MovRM(dstReg, m) -> " mov " + dstReg.ToString() + ", " + m.ToString()

        | MovMA(m, address) -> " mov " + m.ToString() + ", " + "OFFSET " + address

        | MovMR(m, srcReg) -> " mov " + m.ToString() + ", " + srcReg.ToString()

        | Comment note -> "; " + note

 

Once the list of instructions has been generated, we then needed to build executables. To link them you could use ml.exe and cl.exe, Microsoft’s C/C++ compiler and linker. But since requiring a second step is lame, we embedded the files into .RESX files and exported them at runtime. While the F# compiler supports this easily, we actually did this part in C# so that we could take advantage of Visual Studio’s Managed Resource Designer.

 

  /// <summary>

            /// Takes the given assembly file and compiles it using the hard-coded boot loader code.

            /// </summary>

            public static string LinkAsmFile(string asmFile)

            {

                  try

                  {

                        // First, dump ML and boot.obj to disk.

                        string tempPath = Path.GetTempPath();

                        string mlExeFile = Path.Combine(tempPath, "ml.exe"); ;

                        // Extract the binary resources embedded in this DLL and write to file.

                        WriteFile(MasmResources.MASMResources.ml, mlExeFile);

                        // Shell execute ml.exe on the target assembly file.

                        // This produces a .obj file.

                        ProcessStartInfo procStartInfo = new ProcessStartInfo(mlExeFile, "/c /Cx /coff /Zi \"" + asmFile + "\"");

                        procStartInfo.RedirectStandardOutput = true;

                        procStartInfo.UseShellExecute = false;

                        Process linkerProc = Process.Start(procStartInfo);

                        linkerProc.WaitForExit();

                        if (linkerProc.ExitCode != 0)

                              return linkerProc.StandardOutput.ReadToEnd();

                        else

                              return "";

            }

                  catch (Exception ex)

                  {

                        return ex.Message;

                  }

            }

Sounds good, but notice that to run our F# compiler you need F#MiniJava.exe (the program), MASMResources.dll (C/C++ linker), and FSharp.Core (F# libraries). That all sounds lame. Fortunately F# has two awesome command line flags that will essentially embed any required assembly references into your executable:

 

--standalone:

      Statically link the F# library and all transitively referenced DLLs

      that depend on the F# library into the assembly being generated.

      Disables the generation of the F# interface and optimization

      resources and renames all types beginning with Microsoft.FSharp,

      making them private to the target assembly. There must be no

      duplicate type names across any statically linked

      assemblies, no code may use reflection on any of the

      types in those assemblies, and no code may export F# library types

      as part of an API.

--static-link <string>:

      Statically the given assembly and all transitively referenced DLLs

      that depend on this assembly. NOTE: use an assembly name e.g. mylib

      , not a DLL name, e.g. mylib.dll.

 

(Note: many bad things can happen if you do this, from security vulnerabilities to subtle codegen errors. These flags may be removed in a future release of F# and use at your own risk.)

 

Having taken an undergraduate course in compilers I found the course material to be challenging, but when it came to the project implementation F# made it a breeze.