Outside In


One of the things I love about working here is that I get to see a ton of cool stuff that never ends up actually in a product.  Internal projects, fun employee excursions, research endeavors, and in this particular instance an awesome new system one of our QA guys Ben had come up with to help test one of the C# IDE features.


In this case it was a system designed for testing the new formatting engine we’ve come up with for VS2005.  Instead of just writing test cases that verified that switching options had the desired effect, he decided to try to attack the problem of verifying that any combination of formatting options would work as expected.  What was his proposal?


Before explaining it, first let me step back a second.  When you study how compilers normally work you learn that it’s basically a pipeline of very basic operations.  A lexer (or scanner, or tokenizer) breaks an initial input stream into a sequence of tokens.  These tokens are fed into a parser which then parses out relevant language constructs (like classes, methods, etc.) into a parse-tree.   Finally that parse tree is analyzed and code is generated.   Now, in most languages the results of the lexer are processed a bit before being fed into the parser.  For example, in C# whitespace tokens, comment tokens, and inactive region tokens are stripped before being fed into the parser.  And, as you can see from the language spec the grammatical constructs of the language are defined without ever referring to those tokens.  This makes things much easier, otherwise you’d need to write the grammar like so:


       IfStatement := “if” (WhiteSpace | Comment)* “(” (WhiteSpace | Comment)* Expression “)” (WhiteSpace | Comment)* StatementOrBlock (WhiteSpace | Comment)*


which get’s highly annoying.  By just stripping out those “noisy” tokens we can leave the grammar looking like:


       IfStatement := “if” “(” Expression “)” StatementOrBlock


Which is far easier to understand and code against.  (Quiz: there are a couple of places in C# where whitespace is important, do you know where?  Note: this is disregarding the obvious space necessity between identifiers and/or keywords)


Now, let’s consider a single formatting option: “place a space between ‘if’ and ‘(‘”


Pretty simple really.  So how would you go about testing that it worked?  You could just create a few inputs and verify that, “yes, it behaves properly on these files after I manually checked the results”.  Well, Ben’s approach was to say: “the entire point of a formatting engine is to modify the whitespace that users find relevant but that the language does not.  So since we’re saying whitespace is actually relevant, why not encode it into the grammar and verify it with the parser since it’s incredibly good at validating token streams?”  i.e. let’s change the parser rule in this case into:


       IfStatement := “if” SpaceBetweenIfAndOpenParen “(” Expression “)” StatementOrBlock


(or, more specifically, have the parser check what the SpaceBetweenIfAndOpenParen  option is set to and only accepts the construct if the whitespace present matches the whitespace expected by the modified grammar).


Also, other constructs might not only specify the spacing they expect but also the indentation affect they have.  i.e.


       Block :=  “{” IncreaseIndent “}” DecreaseIndent


These rules tell the parser to increase or decrease the current level of indent, so after reading each line it will check to make sure that it reads in the right amount of whitespace indent before processing the rest of the tokens.


Suddenly by doing this he’s taken verification of this feature out of manual testing and put it into the automatic hands of the parser.   Suddenly we can run the parser against huge real world corpuses (like the BCL) as many times as we want, each time with a different set of formatting options.  If the formatter messes things up then we’ll know because it will create unparsable code!  Also, by combining this with tools we have that can spit out arbitrary C# code based on the grammar of the language, you can get near total coverage of pretty much all variations of code constructs along with all variations of formatting options. 


Maybe I’m weird, but I think that stuff like this is so totally cool.  Often times i’m unsure about how to test some of the code I write because the inputs to the problem are basically: “any possible piece of code that the user could write” and that’s such an unbounded problem that you often have to make a lot of simplifying assumptions in order to get a good handle on it.  For example, we can assume that it’s ok that we might screw up intellisense if you paste the contents of Romeo+Juliet into the IDE.  However, what’s the line between “correct code” and “Romeo+Juliet” that separates what we need to handle well and what we can handle poorly?  That’s always unclear and the choices we make sometimes ends up leading to situations where we don’t handle what a user thinks we should handle.    But with this sort of automated approach that can start generating and testing a wide variety of inputs we can now see in advance what we do and don’t do well.  Then we can make the decision about how we want to proceed long before it ends up in your hands.   In the old system if there was a problem the build would have to get to you, you’d have to complain, we’d have to hear about it, and then we’d have to consider fixing it.  Now, we can know about a whole host of problems far in advance, and the earlier we know about them the better chance there and the easier it becomes to fix them.


Anyways, I just wanted to share this with you since this work isn’t something that will be visible to you in the final product, although the results of the work will be!


Comments (16)

  1. Dean Harding says:

    The only place I know of where whitespace counts is in the preprocess directives, like #if DEBUG etc, where it must be all one-per-line style things.

    Although, I think I found a bug while trying a few things out. The following code compiles fine:

    int i = 0;

    i = i++ + ++i;

    but this does not:

    int i = 0;

    i = i+++++i; // CS0131

    nor does this:

    int i = 0;

    i = i++ +++i; // CS0131

    but this does:

    int i = 0;

    i = i+++ ++i;

    It’s pretty pathalogical, I know, but I don’t see why it shouldn’t compile (it compiles in C++ for example)

  2. I just searched the C# Specification and it looks like the only place where whitespace is important is in preprocessor directives as indicated above.

  3. duncan says:

    Well …. there is some obvious places like between a type and an identifier…

    inti = 7;

    is obviously dependant on some whitespace rules 😉

  4. CyrusN says:

    Dean: The preprocessor is one place.

    As to the +++++ examples.

    Consider:

    i+++++i;

    that gets tokenized as:

    i++ ++ +i;

    which is semantically illegal

    i+++ ++i is fine because that becomes

    i++ + ++i;

    the rules on tokenization say to match as much of the token stream as possible. So when seeing:

    i+++++i;

    it will not become

    i++ + ++i;

    There another couple of places that i was thinking about where whitespace is relevant :)

  5. CyrusN says:

    Alex: Which version of the spec?

  6. CyrusN says:

    Duncan: You beat me by like seconds. I was just updating the post to remove that case 😉

  7. Dean Harding says:

    Actually, I posted my first message just before leaving the office, and as soon as I left I came to that conclusion as well.

    It’s different to C++, but since it’s so pathalogical (and probably easier to implement since you don’t need to keep any state between tokens) then I can forgive it :)

    In which case, it’s another case where whitespace is important.

  8. Radu Grigore says:

    So you have one parser (and lexer) for each combination of options? (And what is SpaceBetweenIfAndOpenParen: a token or a non-terminal?)

  9. aditm says:

    One place i can figure out where whitespace is relavent is use of @ symbol

    string test= @ "test"; // wont compile

    string test= @"test"; //compiles

    can’t remember any other place though

    -Aditya

  10. Robert Kozak says:

    Cyrus,

    I could not see any comments until after I submitted. I wouldn’t have bothered submitting because it seems I didn’t add much to the discussion.

    — Robert

  11. Robert Kozak says:

    (Quiz: there are a couple of places in C# where whitespace is important, do you know where? Note: this is disregarding the obvious space necessity between identifiers and/or keywords)

    #pragma directives and #pragma warnings

  12. Mark Mullin says:

    Two things that might interest you in this area

    1) Sniff+ – it is pretty strong in it’s ability to chew through suboptimal or just plain wrong code and still make sense of it http://www.windriver.com/products/development_tools/ide/sniff_plus/

    2) ANTLR – Terrence Parr’s magnum opus – in his PhD thesis he pretty much changed the face of the world in compiling, by coming up with systems that can do infinite lookahead if necessary

    http://www.antlr.org/

  13. CyrusN says:

    Mark: I’m very aware of ANTLR (and i try to keep up to date with all the development and research done around parsers. LL, LR, whatever, they’re all good to me :) )

  14. CyrusN says:

    Radu: "So you have one parser (and lexer) for each combination of options? (And what is SpaceBetweenIfAndOpenParen: a token or a non-terminal?) "

    No, you have a parser that chooses how it parses based on the configuration of the formatter.

    So: SpaceBetweenInfAndOpenParen is a non-terminal (or Variable) that expands out to nothing if it is not set or expands out to " " if it is set.

  15. TAG says:

    1. Boring example: MyCollection<Nullable<int>>.

    2. Whitespaces are important inside strings and characters. You even had a bug with this – converted white space to tab inside strings ;-))

    http://lab.msdn.microsoft.com/productfeedback/viewfeedback.aspx?feedbackid=FDBK21931

  16. Robert Kozak says:

    Cyrus,

    Why is it that we can’t see comments until we post one?

    — Robert