HTML Agility Pack


I’ve seen this around before, and this post was from June 2003, but it is worth mentioning again!

.NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML…
Here is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. It is an assembly that allows you to parse “out of the web” HTML files. The parser is very tolerant with “real world” malformed HTML. The object model is very similar to what [is provided by] System.Xml, but for HTML documents (or streams).

Sample applications:

  • Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, you name it.
  • Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
  • Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.
  • There is no dependency on anything else than .Net’s XPATH implementation. There is no dependency on Internet Explorer’s dll or tidy or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool.

    Comments (8)

    1. Right now my first idea is to use this package to build valid RSS feeds based on potentially malformed HTML documents… I have to give it a try.

    2. Laslo Tallodi says:

      Hi,

      Your tool is very useful.

      I found a malformed html that failed to parse:

      <html><body><form href="something"><input type ="button" value="something" />wrong text</form></body> </html>

      Result was:

      <html><body><form href="something"/><input type ="button" value="something" />wrong text&gl;form&gt;</body> </html>

      Is there an option to avoid this error?

      Thanks.

      tallodi@chello.hu

    3. First of all, if it does this, it’s because you are outputing as XML (you have set HtmlDocument’s OptionOutputAsXml property to true). If you do not output as xml, it will not do this.

      Now, why is it doing this anyway? It’s because internally, the <FORM> tag is declared as "OK to overlap". Most of the time, FORM is actually overlapped in real world HTML, especially with ASP.Net pages where you can have only one FORM per page.

      So, in terms of DOM, HAP will create a closed and empty FORM node, and a text node with "</FORM>" as the text.

      If you do not want this behavior, you can change any node’s behavior using the HtmlNode.ElementFlags static array, like this:

      string s = "<html><body><form href="something"><input type ="button" value="something" />wrong text</form></body> </html>";

      HtmlNode.ElementsFlags.Remove("form");

      HtmlDocument doc = new HtmlDocument();

      doc.OptionOutputAsXml = true;

      doc.LoadHtml(s);

      doc.Save(Console.Out);

      Have a look at HtmlNode.cs, you will see how this is declared.

      Simon.

    4. Laslo Tallodi says:

      Thank you for your answer. Before I used Tidy and it could managed this kind of html. Although your solution is more precize.

      Best regards,

      Laslo

    5. This should help with data mining malformed html documents like the problem this user has at http://www.visual-basic-data-mining.net/Forum/ShowPost.aspx?PostID=1466