Regex 101 Exercise I5 – Remove unapproved HTML tags from a string


Regex 101 Exercise I5 – Remove unapproved HTML tags from a string


When accepting HTML input from a user, allow the following tags:


<b>


</b>


<a href=…>


</a>


<i>


</i>


<u>


</u>


and remove any others.

Comments (16)

  1. Maurits says:

    This one looks nice and challenging…

  2. Sheva says:

    It seems this can do the trick:

    <[^abiu>]+>

    Sheva

  3. Maurits says:

    But that lets through any tag that contains a, b, i, or u…

    <script> (has i)

    <p onload="…"> (has a)

  4. kbiel says:

    (</?(?:u|i|b|as+href="[^">]*")>)|</?[^>]*>

    Use:

    Regex.Replace(InputString, "(</?(?:u|i|b|a\s+href="[^">]*")>)|</?[^>]*>", "$1")

  5. Maurits says:

    kbiel that’s close but it strips </a> tags (as I read it)

  6. Ryan Heath says:

    how about making the href part optional.

    (</?(?:u|i|b|a(?:s+href="[^">]*")?)>)|</?[^>]*>

  7. Jeno Laszlo says:

    Hi, the correct pattern is:

    </*[^b]{1}[^>]*>

  8. Jeno Laszlo says:

    Sorry, to keep the <b>, <a>, <i> and <u>, the patter is </?[^abiu/]{1}[^>]?>

    🙂

  9. Ryan Heath says:

    @Jeno

    character negation is less expandable.

    What if you want to expand to tags as <table> <td> <tr> etc etc ?

    I think Kbiel’s expression is the best until now …

  10. Jeno Laszlo says:

    Ok, the regex to keep the <b>, <a>, <i>, <u>, <table> <td> <tr> are the following:

    </?(?!a|b|i|u|table|tr|td|/)[^>]*>

    This will keep all the properties like href, src, with, etc.

  11. Ryan Heath says:

    Close, but tags like <img> or <applet> are not excluded…

  12. Maurits says:

    Eric, is part of the exercise to eliminate "approved" tags with "unapproved" properties?

    For example, should this be stripped?

    <a href="…" onclick="…">…</a>

  13. Bill Brown says:

    Here’s what I made:

    Regex anyTag = new Regex(@"<[/]{0,1}s*(?<tag>w*)s*(?<attr>.*?=[‘""].*?[""’])*?s*[/]{0,1}>");

    Then I use a MatchEvaluator that uses two string[] containing the acceptable tags and attributes.

  14. Jeno Laszlo says:

    Hi, I modified my regex to include <img>, <applet> and other tricky tags too. The

    </?(((?!a|b|i|u|table|tr|td|/)[^>]*)|([abiu]w{1,}))>

    is working fine for me, but if you are worried about XML data island and other funky stuff I suggest using the

    </?(((?!a|b|i|u|table|tr|td|/)[^>]*)|((a|b|i|u|table|tr|td)w{1,}))>

    If anybody has any tips how to make it shorter or knows tricks to fool it, please let me know. I tried creating a named group for the tag list and reuse it but it is not working for me.

  15. Jeno Laszlo says:

    Sorry guys, I found a bug in my code, these are fixed versions:

    </?(((?!a|b|i|u|table|tr|td|/)[^>]*)|([abiu][^s>]{1,}))*>

    and

    </?(((?!a|b|i|u|table|tr|td|/)[^>]*)|((a|b|i|u|table|tr|td)[^s>]{1,}))*>

  16. kbiel says:

    Good catch Maurits. This will do it:

    (</?(?:u|i|b|as+href="[^">]*"|(?<=/)a)>)|</?[^>]*>