Identifying tokens: Is word-breaking so easy?

What is a word? It’s basically a question we linguists have to answer when we develop spell-checkers, grammar checkers, when we do automatic dictionary look-up, when we try to interpret (and expand) queries for a search engine, etc… I recently wrote a paper to show that doing word-breaking and tokenization is not as easy as…

8