The lexer hack

Article
08/11/2014

I found recently I like to do the coding more slowly and intersperse more reading than was once the case before the internet was large. The internet is frickin huge now. Try finding what you are looking for. Anyway, I roughly found what I was looking for today in 'the lexer hack'

https://en.wikipedia.org/wiki/The\_lexer\_hack

So what I was wondering..? I have seen compilers where tokenizing isn't an up-front separate pipeline component. It just happens that instead tokenizing is a special case of parsing. It looks like this is what ANTLR likes to do, although I only had a cursory glance. This is kind of elegant because it keeps things simple. All you need to do is concentrate on writing your grammar. Then your tool magicallygenerates syntax trees, yay. :)

So - I wondered if given its elegance this technique is a better modern alternative to the old tokenizer/parser split. And whether it really helps you solve tricky tokenizing problems.

It turns out the latter is true. It can help you solve tricky tokenizing problems if you want your tokenizer to do clever things like tell you the difference between a typename and a variable name.
But it seems like that would only be one of the cases this helps. It can also just let you decide to design your language differently, i.e. do your tokens in a context sensitive way, and maybe in one context => obviously means lambda expression and in another context it obviously means type redirection or something (I made that one up...) and in the 3rd context it will mean the same as greater than or equals, same as >=.

But that's kind of silly. Why would you overload the 3 symbols in this way and need the tokenizer to tell you which meaning the symbols have?

So when would it actually be useful to be doing context sensitive tokenization? When you want to split up the syntax elements differently based on context. Example of that: you might like to let type-names include hyphens. This is potentially going to work fine in a contextual tokenizer because you will never see a typename in the same part of an expression where the - operator is being used for subtraction - a type name would be invalid in this context (a+b-c) so it's obviously actually two variables names with a minus in between, not a reference to the b-c type.

A subtle distinction, and one which has me say to myself "Who would possibly want their compiler to be so lawyer-picky as this? That's too hard to imagine a dumb computer doing."

So anyway. The outcome of all this reading in my mind is that separate tokenizers and parsers is still a perfectly valid design pattern for modern compiler writing. Because it is hopefully going to be OK for your tokenizer to be pretty dumb.

The lexer hack

Additional resources