Better HTML parsing and validation with HtmlAgilityPack


Let’s face it; sometimes the Microsoft.VisualStudio.TestTools.WebTesting.HtmlDocument class just doesn’t cut it when you’re writing custom extraction and validation code.  HtmlDocument was originally designed as an internal class to very efficiently parse URLs for dependent requests (such as images) out of HTML response bodies.  Before VS 2005 RTM, we made HtmlDocument part of the public WebTestFramework API, but scheduling and resource constraints prevented us from adding more general purpose DOM features like InnerHtml, InnerText, and GetElementById.  You could always parse the HTML string yourself, but fortunately there’s a better option: HtmlAgilityPack.

HtmlAgilityPack is an open source project on CodePlex.  It provides standard DOM APIs and XPath navigation — even when the HTML is not well-formed!

Here’s a sample web test that uses the HtmlAgilityPack.HtmlDocument instead of the one in WebTestFramework.  It simply validates that Microsoft’s home page lists Windows as the first item in the navigation sidebar.  Download HtmlAgilityPack and add a reference to it from your test project to try out this coded web test.


using System;


using System.Collections.Generic;


using System.Text;


using Microsoft.VisualStudio.TestTools.WebTesting;


using HtmlAgilityPack;


public class WebTest1Coded : WebTest


{

public override IEnumerator<WebTestRequest> GetRequestEnumerator()

{

WebTestRequest request1 = new WebTestRequest(“http://www.microsoft.com/”);

request1.ValidateResponse += new EventHandler<ValidationEventArgs>(request1_ValidateResponse);

yield return request1;

}

void request1_ValidateResponse(object sender, ValidationEventArgs e)

{

//load the response body string as an HtmlAgilityPack.HtmlDocument


HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(e.Response.BodyString);

//locate the “Nav” element


HtmlNode navNode = doc.GetElementbyId(“Nav”);

//pick the first <li> element


HtmlNode firstNavItemNode = navNode.SelectSingleNode(“.//li”);

//validate the first list item in the Nav element says “Windows”


e.IsValid = firstNavItemNode.InnerText == “Windows”;

}

}



Updated: Fixed XPath query thanks to Oleg’s comment.  Also fixed indention of the code.


Comments (7)

  1. Now, this is cool if you do a lot of html parsing! You can tell I was drawn to it by the word "Agile"

  2. What’s wrong with SgmlReader?

  3. Josh, your sample is broken. //li is absolute XPath selection. So navNode.SelectSingleNode("//li") returns first <li> in the document, not under navNode. If you need to select <li> descendant of navNode you need

    navNode.SelectSingleNode(".//li") or

    navNode.SelectSingleNode("descendant::li");

  4. JoshCh says:

    Thanks Oleg, I thought something wasn’t right with that XPath, but it worked so I left it alone 🙂  I’ll update the code.

    I haven’t used SgmlReader myself, but I’ve read multiple posts saying HtmlAgilityPack works much better for malformed HTML.

    Josh

  5. Jeff Beehler on Sam’s Credo. Josh Christie on Better HTML parsing and validation with HtmlAgilityPack….

  6. Visual Studio Team System for Testers Content Index for Web Tests and Load Tests Getting Started Online

  7. chenming says:

    让我们面对它,有时候,当您正在编写自定义的提取和验证规则时Microsoft.VisualStudio.TestTools.WebTesting.HtmlDocument类不会剪切它。HtmlDoc…