More tokenizing...

Article
08/11/2014

So I flippantly said 'write a helper function that captures the right pattern for tokenizing' last post... But when you sit down to think about it, a helper function feels like the opposite of what you logically do when you are implementing a finite state machine... because there is no way to have helper functions be case statements in a switch block!

i.e. you can do this:

if (t = MatchStringLiteral()) return t;
if (t =TryMatchSymbol('+=')) return t;
if (t = TryMatchSymbol('+')) return t;

And that almost looks clean. (But not really clean - it seems like the whole if (t = blah return t pattern) cannot be abstracted out further, in C# anyway) but then if you ever programmed C your next thought is probably 'Would it perform well? I thought tokenizers are supposed to be implemented by lots of switch statements because those are fast?'

Well, you can try to mix it up:

switch(peeknextchar())
{
case '+':
if (t = TryMatchSymbol('+=')) return t;
if (t = TryMatchSymbol('+')) return t;
case 'a..z': if (t == TryMatchIdent()) return t; //actually I don't think you can do case 'a..z' in C# unfortunately
....

}

- but really this is just getting less and less elegant. And breaking the golden rule of optimization which is measure first. :p
Perhaps there should be a golden rule of abstraction too. Anyone know what it is? I don't think I've heard one yet. The layer of indirection saying comes to mind, but it's hardly prescriptive of an ideal.
If I were to invent one, perhaps it would be 'abstractions should be solutions to an expressiveness problem in your code'

So anyway...
How can we write expressive C# code for tokenizing things? Not worrying about whether it is optimizable...
Well. Regex are an obvious way of specifying patterns of things to accept without writing loops etc... but they work less well for symbols where you have to know all the escape characters...
You might even be able to optimize... see? I can't stop worrying about optimizing. :)

So anyway... these tokens I normally think of as outputs to the tokenizing problem. But maybe the token itself also describes how to solve the problem?

Imagine the increment by token '+='. You can have an instance of the token at some points in the document. They could even all be the same singleton instance if you don't need to know their position in the document. The singleton could also know how to match itself to the next part of the untokenized input stream and say whether it is the next token or not.

There's a whole bunch of conceptual overloading going on here, saying one object knows how to perform many tasks. I feel like this overloading is sort of a natural thing to do in javascript, because it's prototype based, and you can come up with ad-hoc solutions if you suddenly realize a need for separate token instances with different text per instance for string literal tokens, or you realize you need to record position on some tokens. You just start doing it, and de-singletonize your object graph as necessary to express the required variation.
You don't have to think up types and interfaces that plan in advance the separation of roles and responsibilities... that's kinda nice.

In C# OTOH I am like 'do I need to have both a Token type and a TokenGenerator type? And some subclass that implements both might be good, for the singleton-happy case?'
I.e.:

     interface IToken
    {
        string Text { get; set; }
    }

    interface ITokenGenerator
    {
        Func<TokenizingState, IToken> Match { get; set; }
    }

    class Stok: IToken, ITokenGenerator // 'SymbolToken' or 'SingletonToken' - covers all simple symbolic tokens: + - * { } ( ) . , ; :
        // also +=, -=, *=, /=, => 
    {
        public Stok(string tokText)
        {
            this.Text = tokText;
            this.Match = (s) => s.StartsWith(this.Text) ? this : null;
        }

        public string Text { get; set; }
        public Func<TokenizingState, IToken> Match { get; set; }
    }

 Now I can model my tokenizer as just a collection of token generators. Hopefully... [spot any obvious problems yet? :) ]

More tokenizing...

Additional resources