Regex 101 Discussion I5 - Remove unapproved HTML tags from a string

When accepting HTML input from a user, allow the following tags:

<b>

</b>

<a href=…>

</a>

<i>

</i>

<u>

</u>

and remove any others.

******

My first comment is that you should be very careful when you do this sort of thing, because if the user can slip script by your filter, they can execute code on your server. Which is bad. Some attacks exploit html escape characters so that what you see may not look like "<script>".

So be forewarned, and forearmed. Or lower-backed, it doesn't really matter which.

My approach is going to match all HTML tags, and then guard against the ones that I don't want to match. So, I start with:

<.*?>

as my initial match. I'll then refine it so that it won't match the good tags, so I can use replace on the bad ones. I'll start by not match <b> and </b>:

< # opening <
(?! # negative lookahead
(b|/b) # b or /b
)
.*? # match between <>
>

What does that mean? Well, negative lookahead means "try to match the pattern at this point. If you do, the match fails". It doesn't eat any of the characters when it does this. In this case, it will try to match "b" or "/b" inside of the <>, and if it can, it will fail. If it can't, it will succeed.

It's very much like the ^ and $ anchors - the match can only continue if there is a specific condition that is not met. There are both positive and negative variants of lookahead and lookbehind.

Adding the other tags is pretty simple:

< # opening <
(?! # negative lookahead
(
b|/b| # b or /b
i|/i| # i or /i
u|/u| # u or /u
a\s+href.+?|/a # a href= or /a
)
)
.*? # match between <>
>

Doing the right thing with the string inside the "<a href=...>" is left as an exercise to the reader.