Unexpected effects of the RegEx SET operator [Greg]

Regular Expressions (RegEx) are a powerful tool for searching for text that matches specific patterns, but it is also a complex tool that requires care and attention to detail. There are many caveats to using RegEx. Ron has recently published a few excellent blog posts on the .NET RegEx engine [part 1 | part 2 | part 3]. Here, I would like to briefly talk about a common caveat when using RegEx: the set operator [] and its effect on the contained tokens:

A RegEx expression can change its meaning when surrounded by the set operator [ ].

Programmers are used to think in terms of modular expressions. Typically, they assume that an expression does not change its value regardless where you put it. Assume the expression "xyz" evaluates to something specific when it stands by itself:

xyz

We are used to assuming that it will evaluate to the same thing when used to provide a value to a function:

function(xyz)

or as a component in a larger expression:

(xyz) + 11

However, RegEx fails this intuition. Consider the RegEx set operator []. It matches against any character that belongs to the set described in the square brackets. For instance:

  • the pattern "[a-d]" matches any one character between 'a' and 'd', i.e. one of 'a', 'b', 'c' or 'd'.
  • the pattern "[ogbp]" matches any one the specified characters 'o', 'g', 'b' or 'p'.
  • the pattern "[^a-d]" matches any one character that is not in the specified range; i.e. it matches any character except an 'a', 'b', 'c' and 'd'.

Note that in each case, the complete set pattern matches against one character in the input text.

Recall another RegEx construct: the grouped expression ( ). It matches against the pattern contained in the parentheses. Grouped expressions are also used to find specific areas within a larger match. For instance, the RegEx pattern "def(ghi)jkl(m)" matches against the input string 'abcdefghijklmn' as follows:

  • Overall match is 'defghijklm': 'abcdefghijklmn'.
  • 1st expression match is 'ghi': 'abcdefghijklmn'.
  • 2nd expression match is 'm': 'abcdefghijklmn'.

A program can refer to such expressions by index (see here and here for more information about this).

Expressions can be combined by alteration constructs in the same way as characters can. For instance, the pattern "a|b|c" matches one of the characters 'a', 'b' or 'c', i.e. the same as "[abc]". The pattern "(ay)|(bee)|(see)" matches one of the strings 'ay', 'bee' or 'see'. This may suggest that the latter pattern matches the same as "[(ay)(bee)(see)]". Right?

Wrong!

Let us try to construct a RegEx pattern to match all English articles. In other words we need to match any string from the set that contains the strings 'the', 'a' and 'an' (we ignore capital letters for clarity). We will use the following example to track our success:

W h e n     I     w a s     a t     t h e     z o o     I     s a w     a     m o n k e y     a n d     a n     e l e p h a n t.

An obvious candidate solution would be the pattern "\b[(the)(a)(an)]\b". (Note: the sub-pattern "\b" is an anchor that matches a word boundary; it makes sure that we do not match the 'an' in the 'elephant'). But if we try it out, we see that it does not work. The sub-pattern "[(the)(a)(an)]" matches the following highlighted substrings:

W h e n     I     w a s     a t     t h e     z o o     I     s a w     a     m o n k e y     a n d     a n     e l e p h a n t.

The complete pattern "\b[(the)(a)(an)]\b" only matches the 'a' in front of 'monkey':

W h e n     I     w a s     a t     t h e     z o o     I     s a w     a     m o n k e y     a n d     a n     e l e p h a n t.

Possibly, we forgot the alteration construct and the solution is "\s[(the)|(a)|(an)]\s"? Wrong again. Try it for yourselves.

The reason is that the set-construct changes the meaning of the contained sub-expression (or sub-expressions), including the special characters. It transfers it to exactly that – a set of characters. So, our assumption that the pattern "[(the)(a)(an)]" would match against the string 'the' or the string 'a' or the string 'an' was incorrect. Instead, the pattern matches against any one character from the set that contains '(', 't', 'h', 'e', ')', '(', 'a', ')', '(', 'a', 'n' and ')'. From basic set theory we remember that the order and frequency of a set element does not matter, so we are really matching against a set consisting of the following characters: 'a', 'e', 'h', 'n', 't', '(', ')'.

The right RegEx pattern to find all English articles is this: "\b((the)|(an)|(a))\b".
A test verifies this:

W h e n     I     w a s     a t     t h e     z o o     I     s a w     a     m o n k e y     a n d     a n     e l e p h a n t.

Another special character that is unexpectedly changed by the set operator is the any-character token ".". By itself it matches against any one input character, but in a set it becomes a literal and the pattern "[.]" matches against a dot.

To check whether you have understood this article, try to figure out the meaning of the following RegEx pattern. But no cheating! Think about it carefully first, before looking at the answer below.

((abc)|123|[xyz]|Q.[.])

And the answer is:

“ Select a text fragment that matches any one of the four following alternatives:

  • A numbered capture group that matches the string 'abc';
  • The string '123';
  • Any one of the characters in the three-character set that contains 'x', 'y' and 'z';
  • A string that starts with the character 'Q' followed by any one character followed by a full stop. ”

Did you get it right?
Yes? Well done, you are a step closer to being a proficient RegEx developer.
No? No worries. I must admit that when I first faced the above English article example, it took me quite a while to figure it out too. Keep practicing and check out some of the following material: