Better HTML parsing and validation with HtmlAgilityPack

Article
12/10/2006

Let's face it; sometimes the Microsoft.VisualStudio.TestTools.WebTesting.HtmlDocument class just doesn't cut it when you're writing custom extraction and validation code. HtmlDocument was originally designed as an internal class to very efficiently parse URLs for dependent requests (such as images) out of HTML response bodies. Before VS 2005 RTM, we made HtmlDocument part of the public WebTestFramework API, but scheduling and resource constraints prevented us from adding more general purpose DOM features like InnerHtml, InnerText, and GetElementById. You could always parse the HTML string yourself, but fortunately there's a better option: HtmlAgilityPack.

HtmlAgilityPack is an open source project on CodePlex. It provides standard DOM APIs and XPath navigation -- even when the HTML is not well-formed!

Here's a sample web test that uses the HtmlAgilityPack.HtmlDocument instead of the one in WebTestFramework. It simply validates that Microsoft's home page lists Windows as the first item in the navigation sidebar. Download HtmlAgilityPack and add a reference to it from your test project to try out this coded web test.

using System;

using System.Collections.Generic;

using System.Text;

using Microsoft.VisualStudio.TestTools.WebTesting;

using HtmlAgilityPack;

public class WebTest1Coded : WebTest

{

public override IEnumerator<WebTestRequest> GetRequestEnumerator()

{

WebTestRequest request1 = new WebTestRequest("https://www.microsoft.com/");

request1.ValidateResponse += new EventHandler<ValidationEventArgs>(request1_ValidateResponse);

yield return request1;

}

void request1_ValidateResponse(object sender, ValidationEventArgs e)

{

//load the response body string as an HtmlAgilityPack.HtmlDocument

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(e.Response.BodyString);

//locate the "Nav" element

HtmlNode navNode = doc.GetElementbyId("Nav");

//pick the first <li> element

HtmlNode firstNavItemNode = navNode.SelectSingleNode(".//li");

//validate the first list item in the Nav element says "Windows"

e.IsValid = firstNavItemNode.InnerText == "Windows";

}

}

Updated: Fixed XPath query thanks to Oleg's comment. Also fixed indention of the code.

Better HTML parsing and validation with HtmlAgilityPack

Additional resources