Regex, HTML, and my sanity

The answer I came up with is at the bottom. But first, a brief digression.

There were several responses to my regex puzzle. They can be grouped into:

  1. Here's how you do it
  2. Here's how you do it without using regex
  3. Using regex on this problem will cause the Dow to drop and the end of Western civilization as we know it (Question: Would that mean that eastern civilization would take over? Discuss this in your group, and be ready to present to the larger group when we get back together)
  4. How *do* you do that?

#1 was the kind of response I expected. My original idea was to highlight a regex technicque that made this a lot easier and more robust than the code I had seen suggested.

#2 is interesting. Clearly, if you can find a good library - and it's not more effort to prove that it is good than to create your own (remembering that you always underestimate how much effort it is to do it yourself), you should use it.  But that really wasn't the point of my post - my question was “how you would do it using regex?“.

Which brings us to #3. While I agree with Raymond that there are cases where regex is more trouble than it's worth - something like brace matching comes to mind - I'm not sure that I agree in this case. You can't use an XML approach because HTML isn't required to be well-formed, which means you're either using a library or writing custom code. I'm not convinced that custom code is going to be more robust than a well-written regex without a fair bit of testing, and I do know that it would just as easy (perhaps easier) to write custom code that isn't robust as it would be to write a regex that isn't robust.

On to our solution. Note that I'm not claiming that this is a robust and tested solution - I'm more interested in showing off a regex technique. If you want to use it for real, be sure to test it well.

Conventional regex systems would require us to enumerate every tag that we want to replace. In that direction lies madness, as it's pretty likely you won't get it right. The example I saw, for example, didn't even replace “<script>“. But .NET regex (and current Perl syntax, IIRC...) allows you to use zero-width assertions and specify what you don't want to match.

The first step is to create something that matches a xml tag. The simplest version is:

<.+?>

which works great if there are no embedded “>“ inside the tag. To be able to handle a quoted attribute such as

 <button text=“<Hello>“>

we'll need to modify the regex to handle that case specifically. Here's the regex to do it:

( # group
[^"]+? # One or more non-" chars. Matches tag with no quotes. non-greedy
| # or
# match something like <fred a="<5>">
.+? # Everything up to ", non-greedy
" # literal “
.*? # zero or more characters after quote, non-greedy
" # literal “
.*? # zero or more characters after quote, non-greedy
)

Now that we have that, we have to tell it what tags not to match. We can do that with a negative lookahead:

(?!br|/br|p|/p)

The key to the lookaheads/lookbehinds is that they don't consume any characters. So, this says “It's okay to match at this point unless the string is one of “br“, “/br“, “p“, or “/p“ (yes, you'd need to use a case-insensitive match to cover both upper and lowercase versions).

Lookahead is a great feature to have if you're trying to do more than one thing in a regex. Here's the full regex.

< # opening < of the tag
(?!br|/br|p|/p) # negative lookahead. Match wil fail if any of these are present
( # group
[^"]+? # One or more non-" chars. Matches tag with no quotes. non-greedy
| # or
# match something like <fred a="<5>">
.+? # Everything up to ", non-greedy
"
.*? # zero or more characters after quote, non-greedy
"
.*? # zero or more characters after quote, non-greedy
)
> # close of tag