Regex 101 Discussion I5 – Remove unapproved HTML tags from a string


When accepting HTML input from a user, allow the following tags:


<b>


</b>


<a href=…>


</a>


<i>


</i>


<u>


</u>


and remove any others.


******


My first comment is that you should be very careful when you do this sort of thing, because if the user can slip script by your filter, they can execute code on your server. Which is bad. Some attacks exploit html escape characters so that what you see may not look like “<script>”.


So be forewarned, and forearmed. Or lower-backed, it doesn’t really matter which.


My approach is going to match all HTML tags, and then guard against the ones that I don’t want to match. So, I start with:


<.*?>


as my initial match. I’ll then refine it so that it won’t match the good tags, so I can use replace on the bad ones. I’ll start by not match <b> and </b>:


<                        # opening <
  (?!                        # negative lookahead
   (b|/b)                      # b or /b
  )
  .*?                    # match between <>
>


What does that mean? Well, negative lookahead means “try to match the pattern at this point. If you do, the match fails”. It doesn’t eat any of the characters when it does this. In this case, it will try to match “b” or “/b” inside of the <>, and if it can, it will fail. If it can’t, it will succeed.


It’s very much like the ^ and $ anchors – the match can only continue if there is a specific condition that is not met. There are both positive and negative variants of lookahead and lookbehind.


Adding the other tags is pretty simple:


<                        # opening <
  (?!                        # negative lookahead
   (

    b|/b|                      # b or /b
    i|/i|                      # i or /i
    u|/u|                      # u or /u
    a\s+href.+?|/a             # a href= or /a
   )
  )
  .*?                    # match between <>
>


Doing the right thing with the string inside the “<a href=…>” is left as an exercise to the reader.

Comments (16)

  1. Maurits says:

    … so <a href="about:blank" onclick="while(true) { print(); }">gotcha</a> still goes through… or is that part of the exercise left to the reader? :)

  2. ericgu says:

    Yes, that is part of the exercise left to the reader.

  3. Maurits says:

    Sorry, but your regex doesn’t work.

    It correctly does not match <i>, <u>, etc.

    But it incorrectly does not match <img> and <iframe> and <ul> too…

    Here’s a fix.

    < # opening <

    (?! # negative lookahead

    (

    b|/b| # b or /b

    i|/i| # i or /i

    u|/u| # u or /u

    as+href.+?|/a # a href= or /a

    )

    # FIX

    > # let’s be strict

    # END FIX

    )

    .*? # match between <>

    >

  4. Maurits says:

    <img> is not as safe as it sounds because of <img dynsrc="…"> which is an implicit weakened <object> tag (shudder)

    http://msdn.microsoft.com/workshop/author/dhtml/reference/properties/dynsrc.asp

  5. Maurits says:

    I guess the moral that all of us (everyone that submitted a guess, and even the acclaimed author Eric himself) should take from this is…

    TEST YOUR CODE!

    Make a list of things that should get through:

    <a href="test">

    <u>

    </u>

    <b>

    </b>

    <i>

    </i>

    And a list of things that should be stripped:

    <img>

    <applet>

    <object>

    <script src="nasty-url">

    <a href="safe" onmouseover="nasty-script">test</a>

    And verify that the pattern gets all of them right.

    I can say this because I’m guilty of making the same mistake, several times over. :’-(

  6. Maurits says:

    Here’s a version that allows multiple attributes (img src=… height=… alt=… etc.)

    It also catches nasty things like:

    <object



    >

    where the tag is spread over multiple lines… there is a regexoptions that will make the previous regex work for that, but this will work with or without the option (. matches n)

    Permutation of attributes was the tricky bit

    Underscores added as a primitive indentation technique due to this blog collapsing whitespace…

    < # opening <

    (?! # negative lookahead

    __ (

    _____ b|/b| # b or /b

    _____ i|/i| # i or /i

    _____ u|/u| # u or /u

    _____ a # a allows attributes

    ________ ( # start of allowed attributes

    ___________ (s|r|n)+ # attributes need leading space

    ___________ (href|target|title|rel) # allowed attributes

    ___________ (=( # attributes may take a value

    ______________ "[^"]*"| # which may be double-quoted

    ______________ ‘[^’]*’| # or single-quoted

    ______________ [^’"<>]* # or bare

    ___________ )? # end of values

    ________ )* # end of allowed attributes

    _____ |/a| # /a allowed too, of course

    _____ img # img allows attributes

    ________ ( # start of allowed attributes

    ___________ (s|r|n)+ # attributes need leading space

    ___________ (src|border|alt|height|width|title) # allowed attributes

    ___________ (=( # attributes may take a value

    ______________ "[^"]*"| # which may be double-quoted

    ______________ ‘[^’]*’| # or single-quoted

    ______________ [^’"<>]* # or bare

    ___________ )? # end of values

    ________ )* # end of allowed attributes

    _____ # |/img # /img explicitly NOT allowed

    __ ) # end of tag/attribute options

    __ > # no tag prefixing or unauthorized attributes

    ) # end of negative lookahead

    (.|r|n)*? # match between <>

    > # closing >

    Note this is untested (despite my prior moralizing)

    Known issues… this does allow <a href> with no =… after the href.

  7. Maurits says:

    I see a bug in my code already… the

    )? # end of values

    lines should be

    ))? # end of values. What I wrote won’t even parse.

    There may be other bugs… I still haven’t tested it.

  8. Matthew W. Jackson says:

    I’m not at all convinced that this sort of checking belongs entirelin in a regular expression. Bill Brown suggested using an expression to simply match any HTML elements and their attributes, and use code in the replace delegate to determine if they are valid.

    I would have a list of acceptible elements (I would probably discourage <b> and <i> in favor of <strong> and <em>, but that is just me), each with a list of acceptible attributes for each element. In the case of <img>, a list of required attributes (src= and alt=) would be nice.

    But trying to cram all of that into a regular expression is an exercise for masochists.

    Or perhaps we just need a method somewhere that converts HTML tag soup into to XHTML so we can use XML processing on this kind of thing, and then convert it back to HTML at some point.

    If I’m not mistaken, the *true* rules of HTML and SGML are a tad too complicated to parse with regular expressions, and each browser has their own interpretation of bad markup which makes it possible to sneak through improper markup that does something unexpected when viewed in a browser.

  9. kbiel says:

    Eric, you ignored my answer. The much simpler and stricter: (</?(?:u|i|b|as+href="[^">]*"|(?<=/)a)>)|</?[^>]*>

    I simple replace with "$1" strips out all html tags except that which was exactly specified.

  10. Maurits says:

    I notice kbiel’s construct

    Replace("(a)|b", "$1")

    and eric’s construct

    Replace("(?!a)b", "")

    are very similar.

  11. kbiel says:

    They may seem similar, but there is one key difference. Eric’s construct will allow anything that begins with the cases he presents in the negative look-ahead assertion. So, <b> and <bold> and <bumble bee> and <balskdfj;sdkjfksda> will all fall through.

  12. Maurits says:

    Indeed… but only because he forgot to include the > in the (a) part. These two regexps are equivalent:

    (?!x)y style: (?!<(?:b|/b|i|/i|u|/u|a href=.*?|/a)>)<.*?>

    (x)|y style: (<(?:b|/b|i|/i|u|/u|a href=.*?|/a)>)|<.*?>

    except that Eric’s relies on negative lookahead and yours relies on $1.

    TIMTOWTDI, I guess :)

  13. kbiel says:

    Uh…Maurits, while you are right that the two styles you presented are equivilent, what you wrote for the negative look-ahead style is not what Eric constructed. He allowed fall-through because he placed his negative look-ahead within the tag markers (<>), while you moved the assertion outside of the tag markers and then included the tag markers in the assertion to make the match.

    Eric: <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)).*?>

    Maurits: (?!<(?:b|/b|i|/i|u|/u|a href=.*?|/a)>)<.*?>

  14. Maurits says:

    I know, I fixed it 😉

    Eric (original, with bug:) <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)).*?>

    Maurits (fixed v1, ugly but works:) <(?!(b|/b|i|/i|u|/u|as+href.+?|/a)>).*?>

    Maurits (fixed v2, pretty:) (?!<(?:b|/b|i|/i|u|/u|a href=.*?|/a)>)<.*?>

    I think my fix v1 is ugly because the <‘s and >’s don’t balance. Petty, I know, but it’s who I am.