Are you greedy or lazy?


I wrote a “short” explanation of how greedy and non-greedy (aka lazy) work with regular expressions for a mailing list, but it turned into a couple of pages, so I thought I’d record it here.


******


The default regex behavior of something like “.*” is to match as many characters as possible. So, if you write something like:


>.*<


And match against a string with it, the engine will give the *longest* string it can for “.*” that still results in a match. Or, in other words, the “<” will match against the last “<” in the string.


This is pretty much backwards to the way most people expect things to work, but AFAIK came from the fact that if you have only one kind of behavior, it had better be greedy, since there are things you can’t do if lazy is the only behavior you have.


After some time, lazy got added, so if you write:


>.*?<


It means “when you match, match the shortest possible string for .*”.


The type of match you choose can have a big effect on the perf you get. An example is in order.


Assume we write:


>.*<


Which is a perfectly reasonable regular expression. When the engine goes to execute it, it goes from left to right. Assume this is the string we’re matching against:


<start>abcd</start>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


So, the engine starts out by matching the “>”. That’s pretty easy to do, it just advances character by character, until it gets to the first “>”.


It then starts to match the .*, which (of course) means that it matches as much as possible. In this case, it matches:


abcd</start>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


(as many characters as possible, right?) It will then try to match <, but it can’t where it is, so it has to backtrack. Essentially it tells that the part that matched .* that the match fails and the length that .* matches is reduced by one. So, it changes the match to:


abcd</start>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


And we try to match < again, and we fail again. We continue backtracking until we get far enough back so that .* has only matched:


abcd


At which point the match for < succeeds, and we can look at whatever is after < in the pattern. There isn’t anything, so the whole match succeeds. The regex works, but it can take a *long* time. If you have more than one use of .* in the expression and the strings are long, you can get regexes that would take hours or even days to run.


You get around this by either using non-greedy “.*?” or a negated character class “[^<].”. Non-greedy is slightly more efficient, more flexible, and easier to read, so for me it’s the pretty clear choice.


So, for short strings, non-greedy is better. In this case, it will fail on “”, “a”, “ab”, and “abc” before succeeding on “abcd”


The converse is true for long strings – if what you’re matching is near the end of the string and you use non-greedy, you’ll need to keep backtracking to lengthen the string by one character each time, so greedy will be the better choice.


Comments (3)

  1. David Grant says:

    Good post… coincidentally my head has been swimming in these concepts for the last few days.

    aaaahhhhhhhh

  2. Honestly, I’m really glad you posted this. It makes me want to read a book on regexes.

  3. kenan says:

    Hi there,

    Could you help me to understand regex engine behavior?

    Regex is : [^;]*
    Pattern is : aaaa;bbbb; 1123 aaf
    Matched pattern is : aaaa

    My question why regex engine select up to first match of ;
    Why regex engine is not going further after first ; since there are matching characters over there
    Is it because at first, the engine select greedy behavior and then non-greedy behavior (zero match or emty string) or something else?

Skip to main content