Only a few hours left (part 2)

One of the things we love doing around here is dicussing different design techniques for attacking problems in general and the work we do in specific.  One of the very common conversations we have is simply how to structure internal data within the C# compiler and language service to best facilitate things like maintainability, readability, performance, correctness, etc.  With that in mind, i thought i'd bring up a recent discussion we were having that related to that kind of talk.  We were specifically talking about token streams and parse trees, and how we'd like to see them represented.  To start, let's give a simple example of how a token could be represented:

     public enum TokenID {
        Class,
        Interface,
        Identifier,
        Comment,
        Whitespace,
        Public
        /* remaining enum values for the rest of the C# tokens */
    }

    public class Token {
        public TokenID  ID       { get { ... } }
        public Position Position { get { ... } }
        public string   Text     { get { ... } }
    }

A fairly simple representation.  In this case we're basically storing tokens as a discriminated union (with the ID as the discriminator).  One of the benefits of this approach is the ease of which a low overhead parser can be built.  For example, let's look at a hypothetical parser and how it would deal with such a structure:

     public class Parser {
        Token CurrentToken { get { ... } }

        ///This function handles parsing of classes/interfaces/enums/delegates
        void parseType() {
            parseModifiers();

            switch (CurrentToken.ID) {
                case TokenID.Class:
                    parseClass();
                    break;
                case TokenID.Interface:
                    parseInterface();
                    break;
                /* Further cases */

                default:
                    //Handle errors
                    break;
            }
        }
    }

The code above attempts to parse in modifiers (like public/internal) and then determines what to do next based on if it sees the "class" or "interface" tokens. 

This seems to work great, and there isn't a lot of good reason to change from that above style of code.  For a parser, which only uses tokens to determine it's parsing path, it's more than sufficient to have a simplified way for each token to easily affect flow control.  Now, once you go past parsing things get a little interesting.  For example, an IDE needs to examine the token stream intimately in order to do colorization, formatting, IntelliSense, handling incomplete generic declarations, etc.  You end up with lots of these switches all over the place, and in general your code can get very messy.  For example, say you want to have code that handles all accessibility tokens.  You then need to have a switch with fall-throughs defined for those tokens (yech).  Also, if you add a new token you need to make sure that you update all those switches appropriately.

Now, coming from what we know about refactorings, it seems like there's a very good alternative to this.  Specifically: Replace Type Code with Class.  This would work fine... except for one nigglign detail: in many cases you end up cluttering up an class with unecessary functionality.  For example, why should a token have any understanding of how colorization works?  That's a completely orthogonal idea that should reallly be kept seperate from tokens.  This keeps tokens simple, and simplicity is a good thing(tm).

So is there another alternative?  Wait and see :-)