Ambiguities

I spent a few hours yesterday working on a rather interesting problem that we've never really dealt with in the IDE.  As you might expect VC# works like a pretty standard compiler.  There is a lexical tokenizer which takes the raw input source and produces a token stream, a parser which takes that token stream produces a parse tree and a semantic analyzer which interprets those trees and produces some result out of (such as an assembly, or the information necessary to do refactorings safely).  Now, where this compiler differs (in my experience) from others is the extent to how error tolerant it is. 

The first compiler I ever wrote was an exercise is clean design and simple mathematical principles.  It would compile a correctly written with ease but would balk at the first mistake of any kind that it might encounter.  If you typed something like @ in the stream the lexer would die immediately, if you had:

 let func b = if (b < 0 then "negative" else "positive";;

Poof, death. 

Of course this is pretty unacceptable for our IDE to have.  One expects that people will make mistakes, but beyond that the act of typing itself will constantly be introducing errors into any level of the compile process.  If you have:

        public static void main(string[] args) {}

        //start typing "public void bar(_" here

        public void foo() {}

introducing 'public' will cause a semantic failure because of the duplicated accessibility modifier.  Introducing 'void' will cause a parse failure because the parser will expect to parse out a method name at that point but will instead see 'public'".  Introducing the '_' (when you meant to type ')') will cause a tokenization error, etc. etc.  So as you're typing your program is constantly entering and leaving a state of wellformedness (achtung! making up words now!), although it will for the most part not well formed).

 Different systems approach this in many interesting ways.  In Ocaml what you might type might be rejected because of lexical/parsing/semantic errors, but once accepted it becomes part of your environment and it's meaning is immutable.  In VC# we follow a different approach.  Rather than forcing you to get a type or member just right before we accept it into our type system, we try to bend over backwards to accept it even in the presence of errors.

We feel that this is valuable because we don't want to hinder our users from going and typing anywhere they want and we shouldn't have to force them to get everything just right before we'll give them things like intellisense.   Also, in the cases where users want intellisense you'll almost always have errors in your code.  Why?  Well consider the main two ways people get intellisense: completion lists and parameter help.  Completion lists are almost always triggered by a <dot>, i.e.

this

<dot>

When you've typed that it's almost certainly the case that your code is not parsable.  If we refused to parse and understand the type you were in because of that then we certainly couldn't give you a completion list of the members of 'this' type.  (parameter help and other intellisense facilities have the same expectation).

However, there are times when this tolerance can be quite confusing for a user.  Take the following code:

class C {

    public string DoStuff() {

        return i.ToString();

    }

    //fields

    int i;

}

Now say the user manipulates that into the following:

class C {

    public string DoStuff(int j) {

        if (j > 0) {

            return i<dot> //no intellisense here

        }

    //fields

    int i;

}

What happened?  Well, because of the introduction of the "if (j > 0) {" we will now end up parsing "int i" as a local in the current method (which you wouldn't have access to at that point in the code) instead of as a field in the type which you would have access to.  There are many heuristics we currently use (or could use) to try to work around these problems like: paying attention to the placement of curly braces to try to get an idea if you are ending the current block or an outer block, or backtracking when we don't see the end of the current type and trying to understand the block again given that knowledge.  However, in the end the current model is just heuristics.  Pay attention to what's going on locally with the parse and try to make the best guess as to what the user was doing.  But a guess is a guess and we will get it wrong some of the time.  In the above example some users might expect 'i' to be a field and some users might expect 'i' to be a local.  Is it possible to please both?

Of course, this issue of tolerance is not necessarily something a user always wants.  Take refactorings for example.  A refactoring is supposed to allow you to make a change to your code that preserves the same compile time semantics across the refactoring.  Of course, we can't necessarily guarantee that the runtime semantics will be preserved as any code that introspects your assembly (like reflection or bytecode analysis) will be affected in ways that we won't be able to detect.  So how is a refactoring supposed to deal with code that has errors in it?

Say we're renaming 'Foo' in the following example:

class C {

    public void Foo(int i) { }

    public void Bar() {

        Foo(4 //<-- should we rename this usage of it

    }

}

A simple answer to the question would be to say "it doesn't" or "it shouldn't", and that was something we definitely considered.  However, based on feedback we felt that it was important to support this scenario because people would not be using refactorings purely to modify but also to help them in the process of development in whatever state they were currently in.  We do warn strongly that this is a dangerous thing to do and that the refactoring might not affect everything you expect it to.

Ok, very long intro.  So the problem that was happening was again an ambiguous situation that could arise when you're parsing code and how a choice we were making was making life very difficult for the user.  If you've used the beta of VC#2k5 you'll notice a few new features.  One is type colorization.  When we can bind successfully to a type we will colorize for you according to your settings.  If we can't bind it but we can see a using that would make it available we will offer to add the using for you, and once we are able to bind it you'll see it in the code definition window.  For example if you had a file without usings you could type:

Console

At that moment we'd offer to add the "using System;" for you.  If you did that then immediately Console would be colorized and you'd see all the information about that type in the code definition view.  You could even say "go to definition" and we'd bring up the source for that type in the main editor.  It's a pretty good model and gives your strong immediate feedback about the items you're currently editing. 

However, you may have noticed that that model breaks down when it comes to generics.  i.e. if you type:

IList<Foo>

Then regardless of whether or not you have a "using System.Collection.Generic" we won't colorize it.  If you don't have the using we won't offer it, and if you add it yourself goto definition won't work and  you still won't see the item in the code definition window.  However, if you make some tiny edit like adding a dot:

IList

<Foo>.

then suddenly things work.  Of course, you won't get the "add using" smart tag now that you're not on the type name anymore.  What's going on!?   Youve' been used to having all these feedback mechanisms to tell you when you're typing correct code and you've come to expect that if you don't get that feedback that something is wrong.  But now we're not giving you the feedback causing you to scratch your head in wonder and ask... "what am I doing wrong". 

As it turns out, the problem is that when you type IList<Foo> we cannot unambiguously parse that as a generic name and so we treat it like the grammar says to treat it: as an arithmetic expression (i.e. Ilist less than foo greater than..).   aside: if you want a fun problem consider trying to detect that something is generic as you're typing it so that you can show things like parameter help for generics.  i.e. when you have:  "IDictionary<Foo, Bar", and then consider expressions like:

Foo(a < b, c > d) vs Foo(a<b,c>(d))

it's quite fun :-)
aside #2:  Another fun parsing disambiguation is the following: is this a cast expression or an arithmetic expression?
type t = (type)-foo;  //hint, consider operator overloading

So as we parse it as an arithmetic expression we don't bind or colorize it as a type, and intellisense basically doesn't understand it.   In the end it's a fairly sucky experience.  The work flow that you've come to expect just falls apart as you're in the middle of typing generics.  

Aside #3: They're showing Garfield on the flight that I'm currently on, and I can't figure out what's worse: having it on in the background, or actually having my eyes clawed out by real cats.

So can we do better?  The answer is, without question, "yes".  Unfortunately, like all decisions it comes down to what's possible to do with the time we currently have left.  I think ideally I'd like to see a compiler that could deal with errors and ambiguities by producing many valid results weighted by how likely it thinks it is (that is what we're doing right now except that we're limiting the result to just one value).  It could even be a plug-in model where different analyzers could be used to try to figure out what would be the best choice is very domain specific circumstances.  Consumers could then get these results and make more informed choices based on it.  In the field/local example we could choose to bind to it maybe with the explanation that we were being tolerant and guessing that you meant it to be a field.  In the generics example we could attempt to use the semantic information available later to them take that ambiguity and decide to treat it as generic or not accordingly.

However, i think that this model would be quite costly both to implement and also potentially to execute.  The current parser is quite brisk baby and modifying to do all sorts of backtracking and going down many parse paths all the time could be expensive both in terms of space (which rarely concerns me) and time (which really concerns me).  It would also mean finding all consumers of the compiler data and making sure that each one of them would handle this correctly.  Even adding a new type of parse tree node (like an ambiguity node) could be enormously costly as it would mean that many people who manipulate or consume the parse tree might break as they were not expecting to see this type of node where they currently were.

So the attempt has been so far to attempt this kind of ambiguity handling on a case by case basis (i.e. solve the particular generics problem) and to implement it in a manner that would not affect anyone currently using the data we generate.  The results have been promising.  There should be very little overhead to doing this ambiguity checking, and consumers need to opt-in stating that they want to get these results and they will be responsible for handling parse trees which no longer follow the normal invariants we've always preserved. (like the span of parse tree nodes should not overlap).

Aside #4; They're going to show  Shrek 2 now.  Redemption has occurred.

This is an extremely interesting area that I think will be very important to address with future architectures.  Not just for dealing with these situations better but for also making it clearer to the user how much we do/don't understand of their code.  Currently with the class designer you can get a high level view of what our understand of the code is.  In the future I think it would be very interesting to extend that to all code you write.