Are you greedy or lazy?

I wrote a "short" explanation of how greedy and non-greedy (aka lazy) work with regular expressions for a mailing list, but it turned into a couple of pages, so I thought I'd record it here.

******

The default regex behavior of something like ".*" is to match as many characters as possible. So, if you write something like:

>.*<

And match against a string with it, the engine will give the *longest* string it can for ".*" that still results in a match. Or, in other words, the "<" will match against the last "<" in the string.

This is pretty much backwards to the way most people expect things to work, but AFAIK came from the fact that if you have only one kind of behavior, it had better be greedy, since there are things you can't do if lazy is the only behavior you have.

After some time, lazy got added, so if you write:

>.*?<

It means "when you match, match the shortest possible string for .*".

The type of match you choose can have a big effect on the perf you get. An example is in order.

Assume we write:

>.*<

Which is a perfectly reasonable regular expression. When the engine goes to execute it, it goes from left to right. Assume this is the string we're matching against:

<start>abcd</start>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

So, the engine starts out by matching the ">". That's pretty easy to do, it just advances character by character, until it gets to the first ">".

It then starts to match the .*, which (of course) means that it matches as much as possible. In this case, it matches:

abcd</start>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

(as many characters as possible, right?) It will then try to match <, but it can't where it is, so it has to backtrack. Essentially it tells that the part that matched .* that the match fails and the length that .* matches is reduced by one. So, it changes the match to:

abcd</start>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

And we try to match < again, and we fail again. We continue backtracking until we get far enough back so that .* has only matched:

abcd

At which point the match for < succeeds, and we can look at whatever is after < in the pattern. There isn't anything, so the whole match succeeds. The regex works, but it can take a *long* time. If you have more than one use of .* in the expression and the strings are long, you can get regexes that would take hours or even days to run.

You get around this by either using non-greedy ".*?" or a negated character class "[^<].". Non-greedy is slightly more efficient, more flexible, and easier to read, so for me it's the pretty clear choice.

So, for short strings, non-greedy is better. In this case, it will fail on "", "a", "ab", and "abc" before succeeding on "abcd"

The converse is true for long strings - if what you're matching is near the end of the string and you use non-greedy, you'll need to keep backtracking to lengthen the string by one character each time, so greedy will be the better choice.