The great thing about regular expression engines is that there are so many to choose from


Back in the days before perl ruled the earth, regular expressions were one of those weird niche features, one of those things that everybody reimplements when they need it. If you look at the old unix tools, you'll see that even then, there were three different regular expression engines with different syntax. You had grep, egrep, and vi. Probably more.

The grep regular expression language supported character classes, the dot wildcard, the asterisk operator, the start and end anchors, and grouping. No plus operator, no question mark, no alternation, no repetition counts. The egrep program added support for plus, question mark, and alternation. Meanwhile, somebody went back and added repetition counts to grep but didn't add them to vi; somebody else added the \< and \> metacharacters to vi but didn't add them to sed. POSIX added repetition counts to awk but changed the notation from \{n,m\} to {n,m}. And so on.

No two programs use the same regular expression language, but they overlap sufficiently that you can often get by with the common subset and not have to worry about which particular flavor you're up against.

Until you wander into the places where they differ.

From: John Jones
Subject: Problem with regular expression

I'm trying to write a regular expression to match blah blah blah.

From: Jane Smith
Subject: RE: Problem with regular expression

I think this will match what you want: ^Z@1&*B*!34

I just ran my hand randomly over the keyboard to generate that fake regular expression. The scary thing is, at first glance, it is not obviously not a regular expression!

From: Chris Brown
Subject: RE: Problem with regular expression

Try $)(#$C)*#

From: John Smith
Subject: RE: Problem with regular expression

Thanks, everybody, for your suggestions, but I can't get any of them to work. For example, I can't get any of them to match against this string: blah blah blah blah.

At this point, people chimed in with other suggestions, confirming that John doubled the backslashes, that sort of thing. John posted his test program, and then the reason was obvious.

From: Jane Smith
Subject: RE: Problem with regular expression

Oh, you're using CAtlRegExp. In that class, \w doesn't match a single character; it matches an entire word. You want to use \a instead.

Comments (39)
  1. Karellen says:

    ObQuote: Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems.

    — Jamie Zawinski <http://www.jwz.org/hacks/marginal.html&gt;

  2. Karellen says:

    Nooooo….. "Using <> angle brackets around each URI is especially recommended as a delimiting style for a reference that contains embedded whitespace."

    — RFC 3986, Appendix C

    Blog software comment automated URL extraction FAIL.

  3. KT says:

    I don’t understand, why would you purposely give someone a fake, invalid regular expression for the problem they are trying to solve? Isn’t it easier and less jerky to just not respond? I must be missing something here.

  4. Adam V says:

    Blog software comment automated URL extraction FAIL.

    Ah, it’s using an outdated RegEx. Email the blog software maintainers and tell them to change it to $*G#(:@}6)V^%&]1[; which should fix it.

  5. Adam V says:

    @KT: Raymond’s posting these emails from memory. The real emails had (slightly more) correct RegExs, he just didn’t want to dig through his email to get the RegEx when he was going to make up the rest of the (fake) email.

  6. Leo Davidson says:

    Yeah, it’s a pain that there’s no standard regexp syntax. :( Probably too late to fix that, too.

    I’ve tried to explain this to people who get quite angry that $program doesn’t support $syntax when they think that particular syntax is The One True RegExp and it’s a moral outrage if anything is different.

    I guess some programs have an option for which regexp syntax you want to use but by doing that… you’ve probably got three problems instead of just two. :)

  7. ton says:

    There really does need to be a standard across all software systems for Regular expression support. There are actually two different flavors of RE support for the Visual Studio "Find" dialog and the .NET framework itself alone much less the differences between the *nix tools. It’s just maddening. I wouldn’t be at all surprised to find the same kinds of inconsistencies between the CAtlRegExp class and Visual Studio’s "Find" as well.

  8. Neil (SM) says:

    @KT: First of all, Raymond was not quoting himself in that posting. Second, the real posting had a real regex.

  9. J says:

    And then sometimes you write your regex unquoted on the command line so you have to add in the syntax for that particular shell’s escape characters, which of course varies from shell to shell.

  10. costive expressionist says:

    clearly what we need is a way of selecting and perhaps event tweaking the regular expression engine at run time:

    /usr/bin/kludge –regexp_style=greg

    or

    /usr/bin/kludge –regexp_style=perl –regexp_exclusions=boundary –nongreedy=off

    I’ll be monitoring SourceForge for this.

  11. costive expressionist says:

    Oh yes, and the minute that the configurable regexp feature is available and semi-usable in the first open source implementation, I will be sure to come back to MSDN and abuse Microsoft for not supporting it!

  12. Daniel Colascione says:

    God, I can’t stand that we have separate, yet similar regular expression languages for almost every application. There’s grep, egrep, Javascript, Perl/PCRE, Python, Emacs, and that’s just in the Free Software world.

    Personally, I’d encourage everyone to just use the PCRE library. It’s feature-rich, and because it’s BSD licensed, so you can even use it in commercial software.

  13. Mark says:

    I think the problem, of multiple varying syntaxes for regex, is due to practical concerns. Most of the regex operators aren’t strictly necessary; they are just shorthand.

    For example with an alphabet of ‘ab’ and using e for ε (epsilon (just in case ε doesn’t render)).

    [ab] = (a + b)

    a+ = aa*

    a? = (a + e)

    much of the shorthand is designed to reduce typing while others (?) are introduced because of the lack of an epsilon character.

    the classic mathematical syntax works well for problem encountered in computation theory since the alphabets are usually very limited. In my theory of computing course, most problems only had an alphabet consisting of just ‘ab’.

    The syntaxes used in the practical world have much larger alphabets, which include the characters used as operators. Just imagine how long that IPv4 regex would be if the character class shorthand wasn’t used, every instance of d would have to be replaced with [0123456789] (which is really short for (0 + 1 + … + 9))

    The biggest problem with having multiple syntaxes, is that if i want/need to use regex, i have to go look up the syntax every single time. (i actually printed out the a cheat sheet for the visual studio search regex).

  14. mark says:

    In response to the "you now have two problems" remarks, there are many things that regex can’t do, Matching balanced parenthesis, which is the classic example.

  15. Mark Sowul says:

    Fortunately, .NET’s regular expressions library has a neat trick that does allow you to match balanced parentheses

    http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx

  16. Trevel says:

    Personally, I love the idea that Raymond spends his nights posting bad regexs to mailing lists under the pseudonym of Jane Smith. He’d be like a super hero, only more nerdy and less useful.

  17. steveg says:

    RegEx… get it right straight away, because generally you can only modify a complex regex for approximately 30 seconds after you finish writing it before you forget how it works. Made worse if you’re a tool whore (uh… sorry) and are mixing RegEx dialects.

    .Net has very good RegEx, but why, oh why did they invent new syntax? They should have used Vim’s. :-)

  18. Mike says:

    I love regex’s. But to me it is the great Unknown.  I bought O’Reilly’s Regular Expression Pocket Reference thinking it would be of some help learning them, then I found that only pages 1-16 were about regex’s per se, including a two-page cookbook that was easily the most useful part of the book.  The rest of the book,pages 16-114, delved into the different implementations (PHP, .NET, Apache Web Server, vi, Unix shell tools, Python, Java, and so on) and their little quirky differences.  Necessary, I suppose, but Yikes!

  19. hexatron says:

    You omitted the ancestral re of ed, the unix editor of 1974. Looking at the source for compile(), we find, in those days, that a regular expression understood:

    ^ – beginning of line

      • match last character

    . – match any character

    ( – start of a bracketed expression

    ) – end of a bracketed expression

    t b – tab, bell

    $ – end of line

    [ad-g] – match a single character of adefg (&etc)

    [^ad-g] – match a single character not in adefg

    and that’s about it.(/g to apply more than once)

  20. porter says:

    > There really does need to be a standard across all software systems for Regular expression support.

    Is that before or after "embrace and extend"?

  21. Jeff says:

    @Mark Sowul:

    Fortunately, .NET’s regular expressions library has a neat trick that does allow you to match balanced parentheses

    That’s nice, but it’s not done with regular expressions (which the first paragraph of the article implies).

  22. Worf says:

    @hexatron: You omitted the ancestral re of ed, the unix editor of 1974. Looking at the source for compile(), we find, in those days, that a regular expression understood:

    Actually, Raymond did get it. He called it by its more m9odern descendant, vi. (vi is just a mode of ed, after all).

    Though, after all this, there are few regex engines around. Sure perl/pcre, sed, awk, ed/vi, grep/etc., all have different syntax, but the internal engines all fall into a few flavors. This is important, because certain regexes can cause the engine to hang, while others can get it to be really slow.

    To be honest, though, I’ve never seen terribly complex regex unless it was to illustrate a point (usually about how regex isn’t always the best tool). The reason for this is probably with the explosion of scripting languages, it’s probably easier to break a regex into small pieces, and use the programmatic ability of perl/python/ruby/etc. to aid in matching the whole. TGhe ability to implement state machines makes it less frustrating over all.

    Try matching an IP address, for example. Far easier to do in code than regex. Even a naive implementation that still matches all possible ways of writing a IPv4 address is simpler. And, you get comments.

  23. Drak says:

    @Jeff:

    Generally this is not possible with regular expression, that language just is not descriptive enough to handle this …

    However in .Net this is actually possible with something called Balancing Group Definition. This construct generally looks like (?<name1-name2>).

    How is this then not Regular Expression?

    [Um, because the language it describes isn’t regular? -Raymond]
  24. Mantas says:

    Sometimes I wish all programs – grep, sed, vi, Notepad2 – got modified to use the same (standard) regex library. PCRE maybe. But backward-compatibility isn’t just a Windows thing, doing that would break so many shell scripts (and even a few Windows batch scripts) that everyone depends on.

  25. mark says:

    @drak:

    the long answer is that regular expressions define the regular language, and can describe exactly what a deterministic finite automata (DFA) can accept. Matching brakets requires a Push-down Automata (PDA) and can accept context free grammers.

    http://en.wikipedia.org/wiki/Deterministic_finite-state_machine

    http://en.wikipedia.org/wiki/Pushdown_automaton

  26. Drak says:

    If .Net says ‘This is part of our Regular Expression Library’ then I’d assume they mean ‘This function is part of our definition of ‘Regular Expression’, be they regular or not.

  27. Yes, it is great to discover that Apache’s regexp syntax is *almost* the same as the one Perl is using – except that it doesn’t support the d character class. Makes for a fun debugging session.

  28. Neil says:

    So that original grep “regular” expression language wasn’t a true regular language because there were some constructs that it couldn’t express because it lacked alternation?

    [It’s still a regular language. Not a maximal regular language, but nobody claimed that it was. A language that consists only of the null string is still regular. Not very useful, mind you. -Raymond]
  29. My pet peeve with regexs is the lack of consistency.

    Pick a character. Say you match that specific character, do you put a on it or not?

    To match ‘d’, use ‘d’.

    To match ‘.’, use ‘.’.

    To match ‘(‘, use… uhh… Let me look it up…

  30. Not only is it too late to harmonise the existing implementations, it’s also too late to even bother to make a token gesture of publishing an ISO standard, because so many other widely used standards already incorporate a version (or two) of regex.

    The POSIX standard defines two levels of support (basic and extended). ECMAScript includes a Perl-ish flavour. The C++0x standard will include ‘regex’, which most recently is proposed to default to ECMAScript syntax but allow optional alternatives. Unicode Technical Standard 18 defines "Unicode Regular Expressions", which are used by XML Schema.

  31. J says:

    It’s the same thing I love about Java. Your "write once run anywhere" code will behave differently when run under different app servers.

  32. peterchen says:

    Trevel, you have implanted a strange image in my mind.

  33. Mihai says:

    "I just ran my hand randomly over the keyboard to generate that fake regular expression. The scary thing is, at first glance, it is not obviously not a regular expression!"

    That is by design :-)

    "Larry Wall falls asleep and hits Larry Wall’s forehead on the keyboard. Upon waking Larry Wall decides that the string of characters on Larry Wall’s monitor isn’t random but an example program in a programming language that God wants His prophet, Larry Wall, to design. Perl is born."

  34. David says:

    I like how Visual Studio 2005’s search and replace doesn’t use the .NET regex flavor. Nothing like having to keep 2 flavors in your memory space at once while using the same tool

  35. Pick a character. Say you match that specific character, do you put a on it or not?

    The rule of thumb (for Unix-y regexes, anyway) is, use a if-and-only-if the character is punctuation.

    This rule will occasionally lead you to use a backslash unnecessarily, but that will never break anything.  For example, _ is the same as _ by itself.

  36. ^Z@1&*B*!34 is a perfectly good regular expression.  It matches anything that:

    Starts with Z@1

    Followed by zero or more &s

    Followed by zero or more Bs

    Followed by !34

    For example:

    Z@1!34

    Z@1&&&!34sao8f

    Z@1&BB!34sao8f

    $)(#$C)*# is not a regular expression though, because of the unbalanced parentheses…

    … unless this is Perl, where $) is a short name for $EFFECTIVE_GROUP_ID.

    This will match different things depending on whether /x is in effect (in which case it’s not a regular expression because the # is a comment which wipes out the rest of the line, so there are unbalanced parentheses after all) and whether there’s a $C variable in scope.

    $C in scope: this will match anything that contains:

    The effective group ID of the process

    Zero or more occurences of # plus the value of $C

    A final #

    For example, if the effective group ID is 123 and $C is "foo", this will match

    blahblah123#blahblah

    blahblah123#foo#foo#foo#blahblah

    No $C in scope: this will match anything that contains:

    The effective group ID of the process

    Zero or more occurences of # plus end-of-line plus C

    A final #

    Since the end of line comes before n, and C is not n, $C will never match – so the "zero or more" will only ever match zero times…

    So this matches things like

    blahblahblah123#blahblahblah

  37. !(regex) says:

    Is there any better expression matching language out there?

  38. Michiel says:

    (E)BNF probably qualifies. It can also describe non-regular languages, e.g. a rule like A ::= ε | ‘(‘ A ‘)’ matches paired ().

  39. pne says:

    @Worf: vi is a mode of *ex*, not of ed.

Comments are closed.