Regex, HTML, and my sanity

The answer I came up with is at the bottom. But first, a brief digression.

There were several responses to my regex puzzle. They can be grouped into:

  1. Here’s how you do it

  2. Here’s how you do it without using regex

  3. Using regex on this problem will cause the Dow to drop and the end of Western civilization as we know it (Question: Would that mean that eastern civilization would take over? Discuss this in your group, and be ready to present to the larger group when we get back together)

  4. How *do* you do that?

#1 was the kind of response I expected. My original idea was to highlight a regex technicque that made this a lot easier and more robust than the code I had seen suggested.

#2 is interesting. Clearly, if you can find a good library – and it’s not more effort to prove that it is good than to create your own (remembering that you always underestimate how much effort it is to do it yourself), you should use it.  But that really wasn’t the point of my post – my question was “how you would do it using regex?“.

Which brings us to #3. While I agree with Raymond that there are cases where regex is more trouble than it’s worth – something like brace matching comes to mind – I’m not sure that I agree in this case. You can’t use an XML approach because HTML isn’t required to be well-formed, which means you’re either using a library or writing custom code. I’m not convinced that custom code is going to be more robust than a well-written regex without a fair bit of testing, and I do know that it would just as easy (perhaps easier) to write custom code that isn’t robust as it would be to write a regex that isn’t robust.

On to our solution. Note that I’m not claiming that this is a robust and tested solution – I’m more interested in showing off a regex technique. If you want to use it for real, be sure to test it well.

Conventional regex systems would require us to enumerate every tag that we want to replace. In that direction lies madness, as it’s pretty likely you won’t get it right. The example I saw, for example, didn’t even replace “<script>“. But .NET regex (and current Perl syntax, IIRC…) allows you to use zero-width assertions and specify what you don’t want to match.

The first step is to create something that matches a xml tag. The simplest version is:


which works great if there are no embedded “>“ inside the tag. To be able to handle a quoted attribute such as

 <button text=“<Hello>“>

we’ll need to modify the regex to handle that case specifically. Here’s the regex to do it:

(                    # group
[^”]+?                 # One or more non-” chars. Matches tag with no quotes. non-greedy
|                      # or
                       # match something like <fred a=”<5>”>
.+?                     # Everything up to “, non-greedy
”                       # literal “
.*?                     # zero or more characters after quote, non-greedy
”                       # literal “
.*?                     # zero or more characters after quote, non-greedy

Now that we have that, we have to tell it what tags not to match. We can do that with a negative lookahead:


The key to the lookaheads/lookbehinds is that they don’t consume any characters. So, this says “It’s okay to match at this point unless the string is one of “br“, “/br“, “p“, or “/p“ (yes, you’d need to use a case-insensitive match to cover both upper and lowercase versions).

Lookahead is a great feature to have if you’re trying to do more than one thing in a regex. Here’s the full regex.

<                    # opening < of the tag
(?!br|/br|p|/p)      # negative lookahead. Match wil fail if any of these are present
(                    # group
[^”]+?                 # One or more non-” chars. Matches tag with no quotes. non-greedy
|                      # or
                       # match something like <fred a=”<5>”>
.+?                     # Everything up to “, non-greedy
.*?                     # zero or more characters after quote, non-greedy
.*?                     # zero or more characters after quote, non-greedy
>                   # close of tag

Comments (12)

  1. James Geurts says:

    Thanks for taking the time to detail how you came up with the solution.

  2. Raymond Chen says:

    I didn’t mean that regexps shouldn’t be involved at all. To me this was a job for using regexps partway and normal code the rest of the way.

    By the way I think you forgot a + after the grouping. Otherwise you can’t handle <A HREF="X" TARGET="Y">

    I would have used something like


    to match a tag, and then used code to reject p and br.

    Note that non-greedy matching doesn’t mean "never ever match a quote"; it will do it if it forced to. So your code will match

    <tag x="a"b">

    since the non-greedy .*+ between the quotes matches a"b.

  3. I havent tried this (sorry!) but would your example handle "<br />" ?

  4. Andrew says:

    I know you aren’t claiming to have a robust solution here, but off the top of my head I can think of several cases where your regex will fail

    safe failures * (no promises:)

    <BR> [using case-insensitive regex would solve this]

    <tag blah=’>’ > [HTML allows single quotes]

    <tag blah="">" > [Escape sequences]

    <tag blah=">" blah2=">" > [Multiple quoted attributes]

    If you were going to be using this regex as an attempt to prevent script injection attacks (if only br and p are allowed (or other simple formatting), then cross site scripting attacks are prevented) you would find that it is easily circumvented.

    for example:

    <script blah=’"’ language=’javascript’>

    alert("you were just hacked");

    </script blah=’"’>

  5. Raymond Chen says:

    Actually escape sequences are not allowed in HTML, so that’s a non-starter. But the other concerns are valid.

    If this were for revoking script injecting attacks, I would just use a sledgehammer. Keep <P> and <BR> and change all other < and > to &lt; and &gt;.

  6. Take Outs: The Digital Doggy Bag of Blog Bits for 26 February 2004

  7. Nice challenge, I had to dust off my old Perl books 🙂

    I made a small correction to your regesp (I asume ignorecase is on)

    </? # match < and </

    b # start group

    (?! # start negative lookahead

    b(br|p)b # group the words, prevents matching <pre>

    ) # end negative lookahead

    b # end group

    [^>]+? # match everything except >

    > # end of tag

    with the comments removed, it looks like this:


    btw: If you want to match every tag, you should use <[^>]+>

    An excellent tool for testing this is Expresso (

    and you’ll find a lot of information at



  8. Eric TF Bat says:

    Here’s a question for you: my beard is too long; I need to shave; I have a chainsaw. How should I proceed?

  9. AvonWyss says:

    Note that your regex does not respect single quotes, which I believe are valid in HTML:

    <bla attr=’>’>

    Will not be matched correctly.

  10. Need_to_know says:

    I have a typical case, where in I am searching for different procedures(start with p_ or sp_ or dbo.sp_ etc.). But I want to exclude the procedures under comments. ie., all proc’s that fall in between /* to */

  11. MBA says:

    Helpful For MBA Fans.