Regex puzzle

I was reading our newsgroups, and I came across a post where the user wanted to filter out all tags from html text except for <br>, </br>, <p>, and </p>.

What is the shortest .net regex to do that?

Comments (13)

  1. haacked says:

    Assuming that the allowed tags may not have attributes, something like this:

    Search for this:


    Replace with "".

    If the allowed tags may have attributes, I’ll get back to you.

    (note, I haven’t tested this).

  2. haacked says:

    Whoops. My solution was too simple. For example, it would not strip out this properly:

    <span title=">">

    This is better (but much harder to understand.


  3. haacked says:

    sorry again. I forget that you can’t edit your posts on these things. Eric, if you could delete my repost, I should point out that the "w+" portion should be "w*".

  4. Ricky Dhatt says:

    Simple? — there is no such thing. In my experience of doing this, there are too many things to deal with, like entities (&nbsp) and XHTML (<br/>. I just used a full fledged parser now days. But I have the luxury of not worrying about the overhead.

  5. Take a look at Html Agility Pack:

    There’s a couple of examples at the bottom of that page which demonstrate the syntax for using it.

  6. Jake says:

    Would it be easier to bring in the HTML into an XML Reader and then parse it through that way? only problem is what would it do with the content (already know that it would handle the tags properly)


  7. Paschal says:

    Eric the user you talk about is not me by any chance 🙂 I asked the question few months ago and I still searching for a solution. This is really puzzled me.

    I need it to clean some HTML documents but I want to keep the breaklines and paragraphs.

    Most of the regex I saw stripped all the tags.

  8. Raymond Chen says:

    A wise man once said,

    Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

    I think this is one of the cases where trying a pure regex solution creates its own problems.

  9. Eric TF Bat says:

    Step one: look for all character #255s (if any) and double them: #255#255 (you’ll see why in a sec).

    Step two: look for all <p>, </p>, <br> and <br/> (note typo in original question; </br> is meaningless!) and replace with #255p, #255/p and #255br. Use regexps to allow for spaces between the br and the /, if you like.

    Step three: convert all & to &amp;

    Step four: convert all < and > to &lt; and &gt;

    Step five: convert all #255p to <p>, #255/p to </p> and #255br to <br /> (note space: required for old, dumb browsers)

    Step six: convert all remaining #255#255 to #255, if you care.

    Don’t use regexps to handle HTML. Raymond is right; that way lies insanity.

    Incidentally, it’s impolite to randomly delete people’s html, which is why I convert the < and > instead of rudely deleting it and giving them a nasty surprise. I just hope this blog’s commenting system doesn’t delete them, cos this message will be (even more) incomprehensible…

  10. In the bag tonight: Less bitch’n and whin’n. Counts:Blogging: 8; Dev: 22; Otherwise: 8; SQL: 5; WILY: 8. Line of the night:

  11. AvonWyss says:

    What’s wrong with this?


    (I’d suggest to use this regex with the ExplicitCapture, IgnoreCase and SingleLine options enabled)

    For some reason that I don’t know, many neat features of the .NET regex engine (not exclusive to it, though) are rarely used, like assertion groups, backreferences, named groups, UNICODE char groups, and match evaluators for replacement patterns…

    Anyways, to specifically clean out HTML code, I’d also rather use some HTML to XHTML converter (like HTML Tidy) and then use some code that works on the XML, or maybe just some XSLT, to get the wanted result.

  12. Lost_In_JavaScript_Land says:

    This works for me (in ASP/VBScript). It keeps <p>, </p>, and <br> (upper and lower case) Also compensates for parameters.

    Function FilterHTML(tempStr)

    Dim re, tempStr2

    Set re = New RegExp

    re.IgnoreCase = True

    re.Pattern = "<((?!P|BR).*).*>.*</1>"

    tempStr2 = re.Replace(tempStr, "")

    re.Pattern = "<((?!P|/P|BR).*)>"

    FilterHTML = re.Replace(tempStr2, "")

    Set re = Nothing

    End Function

  13. Lost_In_JavaScript_Land says:

    Stupid me…just realized this is a C# blog…but converting the code shouldn’t be too nasty 😉