.NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML...

Article
06/04/2003

!! Update 06/08/18 !! Html Agility Pack has a new home on CodePlex! Available here. CodePlex is great :)

!! Update 05/05/05 !! Visual Studio 2005 Beta2 version is available here

!! Update 05/23/05 !! This blog will be discontinued. A new blog were comments will be available has been created here.

Here is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. It is an assembly that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Sample applications:
* Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, you name it.
* Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
* Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.

There is no dependency on anything else than .Net's XPATH implementation. There is no dependency on Internet Explorer's dll or tidy or anything like that. There is also no adherence to XHTML or XML, although you can actually produce XML using the tool.

For example, here is how you would fix all hrefs in an HTML file:

 HtmlDocument doc = new HtmlDocument();

 doc.Load("file.htm");

 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])

    HtmlAttribute att = link["href"];

    att.Value = FixLink(att);

 doc.Save("file.htm");

You can download it here (link updated 12/12/04), full code source and doc included!

.NET Html Agility Pack: How to use malformed HTML just like it was well-formed XML...

Additional resources