.NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML…



!! Update 06/08/18 !! Html Agility Pack has a new home on CodePlex! Available here. CodePlex is great 🙂


!! Update 05/05/05 !! Visual Studio 2005 Beta2 version is available here


!! Update 05/23/05 !! This blog will be discontinued. A new blog were comments will be available has been created here.


Here is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. It is an assembly that allows you to parse “out of the web” HTML files. The parser is very tolerant with “real world” malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).


Sample applications:
* Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, you name it.
* Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
* Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.


There is no dependency on anything else than .Net’s XPATH implementation. There is no dependency on Internet Explorer’s dll or tidy or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool.


For example, here is how you would fix all hrefs in an HTML file:

HtmlDocument doc = new HtmlDocument();
doc.Load(“file.htm”);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes(“//a[@href”])
{
   HtmlAttribute att = link[“href”];
   att.Value = FixLink(att);
}
doc.Save(“file.htm”); 

You can download it here (link updated 12/12/04), full code source and doc included!

Comments (95)

  1. David Stone says:

    Thanks! Definitely going into my toolkit!

  2. Robert Cannon says:

    I have run across an issue with HtmlAgilityPack. I am trying to scrape a site that has some HTML added to the end of the document by the ISP that is hosting the site.

    It is something like this:

    <HTML>



    </HTML>

    <!– text below generated by server. PLEASE REMOVE –><!– Counter/Statistics data collection code –><!– JS Banner blocked –>

    <script>



    </script>

    HtmlAgilityPack will parse this and then wrap the whole thing in a <span> to give the document a single root, which is the <span> node rather than the <HTML> node.

    Is there an option to either 1) ignore the extra markup, or 2) force the extra markup into the <HTML> node?

  3. Simon Mourier says:

    I think it does so because you are setting OptionOutputAsXml to True. In XML, you need a root node without siblings. HtmlAgilityPack creates this fake root node to build valid XML. Just don’t use this OptionOutputAsXml.

    Does this answer / solves your problem?

  4. Robert Cannon says:

    Yes, I have OptionOutputAsXml set to true. I am trying to produce an XHTML file so that I can apply an XSLT to and get an RSS feed. I guess I could make my XSLT to expect the root node to be a <span> instead of an <html>, but it just seems wrong. I was looking for alternatives.

    And I have the source code, so I can tackle it myself, but I just wanted to see if there was already a workaround.

  5. Simon Mourier says:

    You do not need to produce an XHTML file to apply an XSLT to the document (and you should not) The HtmlDocument class supports IXPathNavigable natively for this kind of purpose, so you can just do:

    HtmlDocument doc = …

    XslTransform xslt = new XslTransform();

    xslt.Load("myXslt.xsl");

    xslt.Transform(doc, null, writer);

  6. Rossella says:

    I download Html Agility Pack sourse and I compiled it but it is not executable. When I click on "Debug" and "Go" menu a dialogue window appear. How I execute this project?

    How I interact with a page to parse?

    Thank you.

    E-mail: billarosa@hotmail.com

  7. Eric Newton says:

    There’s another, the SgmlReader, which is a more structured approach to taming the HTML beast for scraping processes.

    http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

  8. Absolutely, but as you say, it uses a more structured approach, and thus modifies "real world" html, which I think is a big problem for many scenarios.

    Do this test:

    1) go to http://www.microsoft.com, do a view source and save the file as mshome.htm (don’t bother with images, .js and all satellite files)

    2) run commandlinesgmlreader.exe mshome.htm mshome2.htm

    3) open an IE on mshome.htm and another on mshome2.htm and you will see they are not rendered the same (fonts, tables, etc…)

    HtmlAgilityPack does not change original html, even if it’s malformed.

    Simon.

  9. Mike says:

    Awesome, the SgmlReader was great, but this is even better! Way to code up the right tool!

  10. Taylor Monacelli says:

    I’m curios what is the difference between html agility pack and mshtml. I’m assuming that the agility pack was written to fix the problems in mshtml. Is this true? If not, then what does the agility pack have to offer that mshtml doesn’t?

  11. They are quite different libraries, not really comparable in my opinion.

    MSHTML is a COM dll, not a .NET assembly (although you can interop with it), with everything that implies in terms of deployment.

    MSHTML has many many dependencies on other DLLs, while Html Agility Pack has absolutely none (in either technical terms or standard ISO terms). MSHTML is client side oriented and has a lot to do with UI and is therefore not suited (at all) for server side operations. And it is somehow strict on HTML code while Html Agility Pack is really not. This is very usefull when you’re talking about real world HTML (read: buggy HTML).

    Html Agility Pack’s purpose is less more ambitious, it basically just parses an HTML fragment (file or stream), builds a DOM out of it and allows you to modify it and save it back. It has however a killer feature that MSHTML does not have: support for XPATH and XSL transforms on plain old buggy malformed HTML code…

    Hope this clarifies.

  12. Sudhir Ramdasi says:

    This is just great! It serves my purpose.

    Thanks a lot.

    -Sudhir

  13. Yanhao Zhu says:

    What a wonderful tool! Thanks a lot!

  14. Crumpy says:

    I’m kind of new to XHTML and XSLT. I wonder if this tool can help me with a dilema though. I don’t want to use MSHTML and the COM interop with my already speedy C# app that crawls and extracts information from the web (the performance would be affected too greatly). I’ve run into recent problems though trying to follow links that call javascript to build dynamic links or set hidden variables before submitting a form etc..

    I want to be able to somehow convert those references into "synthetic hyperlinks" similar to the process described in IBM’s description found at http://www10.org/cdrom/papers/102/ . They seem to be using XHTML and XSLT to do this somehow. Can I somehow execute scripts found in the XHTML using XSLT? By some other means? I’m really at a loss here.

    Thanks for the awesome tool by the way!

  15. Charlie says:

    Wow, hard to believe it took so long for somebody to write and give away such an awesome tool.

    Thanks!

  16. The Html Agility Pack allows you to use XSLT on HTML document it loads. Note, however, that it does not even relies on XHTML format at all. HTML documents do not need to conform to anything but HTML "as we know it in the real world" 🙂

    So, yes, I believe you can use the method described http://www10.org/cdrom/papers/102/ to determine dynamic hyperlinks.

  17. Crumpy says:

    Simon, I can append and prepend new HtmlNodes int the HtmlDocument just as I do using XmlElements in an XmlDocument. However, when I save the modified HtmlDocument object, there are no line-breaks between the new HtmlNodes I inserted. If I enter 10 new nodes, they’re all on the same line in the saved document.

    Just thought I’d let you know. I’ll browse the code.

  18. Hi crumpy, this is all by design 🙂 Nothing is inserted automatically by the Html Agility Pack.

    Here is a code snippet that shows you how to insert a line break before and after a node (there are many ways to do it actually…)

    static void Main(string[] args)

    {

    HtmlDocument doc = new HtmlDocument();

    doc.DocumentNode.AppendChild(HtmlNode.CreateNode("<html></html>"));

    HtmlNode bodyNode = doc.CreateElement("body");

    doc.DocumentNode.FirstChild.AppendChild(bodyNode);

    AddOuterLineBreaks(bodyNode);

    doc.Save(Console.Out);

    }

    static void AddOuterLineBreaks(HtmlNode node)

    {

    if (node.ParentNode == null)

    return;

    node.ParentNode.InsertBefore(HtmlNode.CreateNode("rn"), node);

    node.ParentNode.InsertAfter(HtmlNode.CreateNode("rn"), node);

    }

    Simon.

  19. Crumpy says:

    Sorry to keep bothering you Simon, thanks for your help.

    This is the first time I’ve used XPath to navigate anything and I’m using it to navigate HtmlNodes using the HtmlNode.SelectNodes() function.

    I’m having a problem with the current context, for example. I’ve created and filled an HtmlDocument which contains forms. I then obtain a HtmlNodeCollection of the form nodes, then for each form node I attempt to obtain a collection of input nodes that are a descendent of that form node:

    HtmlNOdeCollection forms = doc.DocumentNode.SelectNodes("//form");

    foreach( HtmlNode formNode in forms )

    {

    HtmlNodeCollection inputControls = formNode.SelectNodes(".//input");

    foreach( HtmlNode inputControl in inputControls )

    {



    }

    }

    The XPath expression ".//input" should return an HtmlNodeCollection containing any input nodes within the form (the ‘.’ specifying the current context, or the current selected node – from what I understand). But I always get back null.

    If I change the expression to "//input" (which should return all input nodes beginning the search from the root node of the document) returns all of the input nodes found in the document (which is correct).

    However, I specifically need just the input nodes within the current form node.

    What am I doing wrong?

    I’ve been testing this against https://recruitmax.alltel.com/recruitmax/candidates/jobopps.cfm which happens to have 2 forms.

    Thanks again!

  20. Hi Crumpy, you really are the "out of luck" guy 🙂 let me explain why. The <form> element deserves, by default, a special treatment by Html Agility Pack: it can overlap. It means you can have HTML like this: <form><b></form></b>, and Html Agility Pack will not report any error and will save it just like that. But it is more a trick than anything else because the <form> node in the DOM does not contain any node, it is declared as empty, and the </form> is declared as a text node with a value of "</form>"… This is why you find nothing inside the <form> element.

    You can change the parsing behavior of the Html Agility Pack, using the HtmlNode static property called ElementFlags: just add the following code before you parse your texte:

    HtmlNode.ElementFlags.Remove("form");

    and you should see the <input> elements inside the <form> elements, just like you thought. Note, however, that <form> elements will not be able to overlap any more if you do this. Without adding this code, you could also fix a complex xpath to find inputs as children of form siblings.

    Simon.

  21. m_prog says:

    I have started example HtmlToRss and there is a mistake " File was not found at cache path… " In cache there is no necessary file. How to cope with it? What file there should be? Where file should enter the name there?

  22. In html2rss.cs, you find this:

    // set the following to true, if you don’t want to use the Internet at all and if you are sure something is available in the cache (for testing purposes for example).

    hw.CacheOnly = true;

    It means we really look for a file in the cache directory. if it’s not there, an exception is thrown.

    Just set CacheOnly to false (at least the 1st time you run html2rss.exe) and recompile.

    Simon.

  23. Gerard Kappen says:

    Hi Simon, there is a slight mistake in the HtmlNode constructore where a "form" tag gets an HtmlElementFlag.Empty. So instead of my inputs being childs of the form they are parsed as being siblings. The fix is easy of course just change <code>ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);</code> to <code>ElementsFlags.Add("form", HtmlElementFlag.CanOverlap);</code>

  24. Charlie says:

    Hey Simon,

    I’m wondering if you’ve thought about creating a slimmed down version of this toolkit. Maybe making the dom forward only, or being able to conditionally turn off some of the internal variables like _line and _lineposition.

    Anyway, just a thought for future improvements to this great toolkit.

    -Charlie

  25. Duh, I have no thought about that yet… Maybe if it becomes a commercial package 🙂 I have too much work to do right now.

    BTW, I am not sure line and lineposition are the ones that really eat memory? strings (names, values, …) are probably more important in that area, even thought they are only lazily allocated (only when requested).

    I suppose if I had time to rewrite it, I would probably focus on string handling. For example, use NameTable and compare references rather than values (just like the Xml parser does with XmlNameTable).

    Simon.

  26. monosodiumg says:

    In the chm, the description is:

    Gets or Sets the text between the start and end tags of the object.

    The declaration on that page is:

    public virtual string InnerText {get;}

    The observed behaviour is as per declaration.

    Why is the InnertText not settable?

  27. Sam V says:

    Hi there, just a simple question. I appologize if I’m being a little dense, but how could you strip all the comment nodes? I tried to go over all nodes and look for a node of type ‘comment’ but I can’t seem to be able to do this?

    Thanks!

    -Sam

  28. This is a sample code to remove comments:

    static void Main(string[] args)

    {

    HtmlDocument doc = new HtmlDocument();

    doc.Load("filewithcomments.htm");

    doc.Save(Console.Out); // show before

    RemoveComments(doc.DocumentNode);

    doc.Save(Console.Out); // show after

    }

    static void RemoveComments(HtmlNode node)

    {

    if (node.NodeType == HtmlNodeType.Comment)

    {

    node.ParentNode.RemoveChild(node);

    return;

    }

    if (!node.HasChildNodes)

    return;

    foreach(HtmlNode subNode in node.ChildNodes)

    {

    RemoveComments(subNode);

    }

    }

  29. You cannot set innerText by design because it’s computed and the doc is wrong as you noticed.

    You can set innerHtml.

    Simon.

  30. Sam V says:

    I seem to have answered my own question!

    Here is source, in case anyone wants it.

    Thanks

    -Sam

    Dim myNodes As HtmlAgilityPack.HtmlNodeCollection = myDoc.DocumentNode.SelectNodes("//comment()")

    Dim node As HtmlAgilityPack.HtmlNode

    For Each node In myNodes

    Console.Write(node.NodeType)

    If node.NodeType = HtmlAgilityPack.HtmlNodeType.Comment Then

    node.ParentNode.RemoveChild(node)

    End If

    Next

  31. Mark says:

    Hey Simon,

    Just thought I’d pass on a tweak I made in case you or anyone else thought it was a useful mod.

    I added the following to the HtmlNode class so as I’m doing whatever to the nodes I find, I can optionally hang any object off the nodes for re-use later.

    -Mark

    ——————-

    internal object _externalobject = null;

    /// <summary>

    /// Gets or Sets the external object associated with the node.

    /// </summary>

    public object ExternalObject {

    get {

    return _externalobject;

    }

    set {

    _externalobject = value;

    }

    }

  32. benles@bldigital.com says:

    It calls HtmlEncode on Html text, thus encoding twice, producing output like

    &amp;nbsp;

  33. Mark says:

    Hi…

    I just started playing with HtmlAgility today and I noticed a couple of odd things – most significant was with the results of some xpath queries.

    I was using the xpath query "//base/@href" (i.e. intending to select an attribute value from the <base> tag if found. What I got back was an odd HtmlNodeNavigator that had LocalName set to "href" and Name set to "base" (i.e. kind of an odd mashing of the parent node with the attribute node). When I get .Current, i get the parent <base> node.

    I don’t know how easy it would be, but perhaps HtmlAttribute could be recoded to be a derivation of of HtmlNode? Seems like it would be easier to emulate xml behavior if they were interchangeable…

    Thanks

    -mark

  34. Hi Mark. You are absolutely right. This is a design error, you cannot use attributes in path selection. You still can use it in filters though, like //base[@href]. This would require to change the HtmlNodeNavigator.cs file … and I have no time to fix it right now 🙂

  35. Ian says:

    Simon, thanks for this great utility.

    I haven’t seen a way to POST data to a site and create a document, am I missing something?

    Also is that a typo in the download link or is it deliberate?

    Ian

  36. If you talk about the HtmlWeb class, you can pass a method (POST or anything) to the LoadUrl. You can also hook the HttpRequest that will be used if you connect to the PreRequest event. You can tweak the method (or anything else) here as well.

    Simon.

  37. Ian says:

    Ahh I see the light! I saw the method arguments but not the event handlers. Thank you. Ian.

  38. Mark says:

    Hey Simon,

    I ran into an issue with html comments today. I’m trying to insert an html comment in the document and it is requiring me to put the "<!–" and "–>" wrappers on the value I set for the HtmlCommentNode. Debugging thru the code it appears that the nodes that are generated by the parse routine incorrectly include those wrapper tags in the value of the node…causing node.OuterHtml to return

    "<!– <!– value –> –>" and node.InnerHtml to return "<!– value –>".

    -Mark

  39. Hi Mark.

    Can you show a sample of your code?

    Simon.

  40. magnum says:

    based on examples related to remove tags with htmlagility pack, how can i remove tags like:

    <o:p>

    triyng: RemoveTag(doc, "o\\:p"); but it returns "System.Xml.XPath.XPathException:"

    private static void RemoveTag(HtmlDocument doc, string tags)

    {

    HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//" + tags);

    if (nodes == null)

    return;

    foreach(HtmlNode node in nodes){

    if (node.ParentNode != null)

    node.ParentNode.RemoveChild(node);

    }

    }

  41. Hi.

    Unfortunately, the support for namespaces is limited in the Html Agility Pack. It does not really know what a namespace is and understands names (prefix ‘:’ localname) as a whole. I agree this is quite confusing 🙂 but most of the time, you can work around it. In your case, this is how you would do it.

    HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//*[name() =’o:p’]");

    foreach(HtmlNode node in coll)

    {

    node.ParentNode.RemoveChild(node);

    }

    Simon.

  42. raoul ellias says:

    My parsed file ends up having attributes like nowrap set to nowrap="" or checked set to checked="". Is there something Im missing?

    Thanks a lot, and this is an awesome tool.

  43. Why is this an issue? (originally this was due to plans for XHTML compatibility, and it also helps for XML output)

    Browsers should not choke on it?

    Simon.

  44. raoul says:

    Thanks, yes the browser does not choke, Im just a little anal.

    Ive been using this on a huge project to insert .net id’s and validators for input fields based on database types. Its a lot of html and I estimate Im saving atleast 1/2 hr a page.

    Thanks!!

  45. Markus says:

    Hi,

    it seems your sample in the beginning doesnt work anymore:

    HtmlDocument doc = new HtmlDocument();

    doc.Load("file.htm");

    foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])

    {

    HtmlAttribute att = link["href"];

    att.Value = FixLink(att);

    }

    doc.Save("file.htm");

    1. DocumentElement is Replaced by DocumentNode

    2. HtmlNode is not indexable anymore (link["href"] won’t work

    3. I tried HtmlNode.GetAttributeValue() and HtmlNode SetAttributeValue but after saving the Document with doc.Save() there weren’t any changes.

    Here is my Code i used:

    HtmlDocument doc = hw.Load ("f1.htm");

    HtmlNode hn = doc.DocumentNode.SelectSingleNode ("//body");

    hn.SetAttributeValue ("new","value");

    doc.Save ( "f2.htm");

    Please correct me if i’m wrong

    greetings

    Markus

  46. Hi Markus, you’re absolutely right. The sample (which was meant for illustration purpose only) is wrong, and it has always been. You’re the first one to really try it I suppose 🙂

    The samples in the .zip file are hopefully ok, though.

    Simon.

  47. Salman says:

    Very nifty tool. Thanks!

  48. Bobstar says:

    Hello…

    Im getting an error when loading the solution. It’s missing:

    ..HtmlDomViewHtmlDomView.csproj

    and

    SamplesGetBinaryRemainderGetBinaryRemainder.csproj

    I appears they are not in the zip-file :O(

    Please help

  49. Mike's Blog says:

    Processing loosely-defined text must rank as the one of

    the worst kinds of pro

  50. &lt;p&gt;Processing loosely-defined text must rank as the one of

    the worst kinds of programming tasks. HTML and CSV parsing

    are about as much fun as cleaning the toilet in a bus

    station—who knows what you’re going to find.&lt;/p&gt;

  51. &lt;p&gt;Processing loosely-defined text must rank as the one of

    the worst kinds of programming tasks. HTML and CSV parsing

    are about as much fun as cleaning the toilet in a bus

    station—who knows what you’re going to find.&lt;/p&gt;

  52. I’ve seen this around before, and this post was from June 2003, but it is worth mentioning again!

  53. zc0000 says:

    Avoid (403) Forbidden errors when using HttpWebRequest I had an error when tried to open the page http

  54. 过世许久 says:

    转载自:http://www.cnblogs.com/dragon/archive/2005/06/15/174946.html

    示例下载

    朋友问到这样一个问题,需要实现如下功能

    1、