Regular expressions and the dreaded *? operator


The regular expression *? operator means “Match as few characters as necessary to make this pattern succeed.” But look at what happens when you mix it up a bit:

".*?"

This pattern matches a quoted string containing no embedded quotes. This works because the first quotation mark starts the string, the .*? gobbles up everything in between, and then the second quotation mark eats the close-quote.

(Note how this differs from ".*", which uses a greedy match. This time, the .* operator is perfectly happy to gobble up quotation marks, as long as it leaves one to match the second quotation mark in the pattern.)

Okay, great, now let’s make a small change to the above pattern:

".*?">

All I did was stick a > at the end of the pattern. This would therefore match a quoted string (containing no quotation marks) followed by a > character, right?

Wrong.

There’s nothing in .*? that says “no quotation marks allowed”. It just says “Don’t match more than you need to.” But there are strings where it needs to match a quotation mark. Consider:

" hello"world " >
" .*? " >

Notice that here, the .*? pattern matched the inner quotation mark because that was the only way to make the pattern match successfully. (“I wouldn’t have done it, but you forced me!”)

Even smart people make this mistake.

If you really don’t want quotation marks to match the .*? then you need to say so.

"[^"]*">

This means “Match a quotation mark, then zero or more characters that aren’t quotation marks, then another quotation mark, and then a greater-than.”

[Raymond is currently on vacation; this message was pre-recorded.]

Comments (10)
  1. asdf says:

    That’s why I always use something like (?:".*?")>

  2. asdf says:

    Which of course didn’t apply to the thing you were talking about.

  3. Norman Diamond says:

    If *? really matched as few characters as necessary to make the pattern succeed, then the matched portion of

    "hello"world">

    would be

    "world">

    for either pattern

    ".*?">

    or

    "[^"]*">

    The inclusion of

    "hello

    was due to left-handed greediness.

    So ONLY smart people would be affected by this designer’s mistake. Dumb people wouldn’t be affected because we’ve used things like "[^"]*"> all our lives (or at least since the time Bell Labs started playing with a Honeywell computer) and we never learned about *?

  4. Ben Hutchings says:

    Matt: That’s what I was thinking, but everything after the "</div>" has to match as well, so the "(.*?)" could capture characters beyond the first "</div>". Some regex systems allow you to "commit" and disallow backtracking after matching up to some point in the regex, which you would want to do after the "</div>". In general you don’t have this option.

  5. Matt C. Wilson says:

    True. I guess I was thinking about just matching <div> contents tag by tag. But a nested div would blow that up. It would be really nice if there were some kind of recursive operation, that would allow you to say something like "give me everything inside <div></div>, with up to 2 nested <div></div> sequences" Is there such a beast, of which I am again unaware?

    Then of course you have to pray that the html source is well-formed :)

  6. Ben Hutchings says:

    That sounds like a job for a relaxed HTML parser (such as IE and Moz use when they don’t see a DTD) and DOM. I don’t know that it’s possible to use just the parser and DOM from these without showing a page though.

  7. foo says:

    I use things like "[^"]*"> all the time. I know about *?, but never found it useful. IIRC, it’s called "lazy quantifier" in the Perl world.

    Can somebody provide an interesting example when you would want to use *?, maybe because an equivalent regex with greedy quantifiers is more complex?

  8. Matt C. Wilson says:

    Count me in the ranks of the dumb then. I never knew *? existed. I always use [^"]* syntax.

    I can see this being really helpful parsing html, as in

    <div>(*?)</div> as a search target.

  9. Ben Hutchings says:

    I’ve been scraping data off web pages and using ".*?" to skip over uninteresting bits that may contain many tags. I’ve since stopped doing that and changed the regexes as I get around to it, because it’s quite dangerous.

    First, if you’re doing a global search to find multiple matches in a page, and the regex isn’t quite right, then "A.*?B" (where A and B are themselves regexes) can match the A part of one instance, fail to match the B part of that instance and so end up matching the B part of the next instance, so it captures a mixture of two records! (This is just a worse version of what Raymond pointed out.)

    Second, a search using a regex with multiple lazy parts can take a very long time to fail, at least in Python, due to backtracking. (This can happen with multiple greedy parts but seems not to do so in general; I don’t understand the theory well enough to see why this might be.)

Comments are closed.