Share via


Broken links in the SDK

My friend Ken Milne has posted in our team blog about a subject near and dear to my heart, broken links in the SDK. If you care about the subject, go check out his post. Ken does a great job summarizing our successes and failures when it comes to links, and correctly points out that even though only 1% of all links in the SDK are broken, that still represents 24,000 broken links, which is tremendously ugly. Still, having that number of broken links presents a conundrum for the SDK team: how do we resolve them?

When I worked on the MSDN Online Library, this was a constant problem. We published somewhere in the neighborhood of 500,000 pages in the Library, of which around a quarter of all pages refreshed every three months. With such a massive churn in content, it was impossible to do a full sanity pass on the content, especially given the resources we had. (It would amaze you how few people actually worked on keeping that content up to date and published.) We often asked product teams to scan their content, but the sorry fact is that while I was there, MSDN never had a tool to check for broken links that could be run systematically.

However, we did have a push to clean up broken links to the extent possible. One of the team's devs was tasked to create a broken link checker, so he created one in an ad hoc manner that essentially utilized some of the back-end code of the HavMatLite tool that Ken describes in his post. We found that roughly 5% of all links online were broken. The obvious next problem was how to fix those links. And there was the hard part.

Taking Ken's estimates of 2.3 million hyperlinks in the Windows SDK, there are probably about 4 million links in the online MSDN Library. If 1% of those links are broken, that represents 40,000 broken links. If the estimate of 5% of all links being broken is true, that represents an astonishing 200,000 broken links.

Add a couple of other complications to that. First is that many of the broken links were false positives. We found that the tool, being slapped together quickly, had trouble with redirected links, even if the redirection was by design. It also had trouble with some of the links that used different variations of the URLs in the Library, and occassionally trouble reading links that weren't authored in a standard way. As much has half the links reported as broken weren't actually broken.

The second major complication is the old story of who is responsible for fixing the links. The obvious place for a fix to live would be with the contributors, since MSDN proper only owns a very small percentage of the content that appears on its site. However, there we run into the ever-present problem of resources. It's an unfortunate fact that User Education within teams tends to be one of the last things that teams put together. Some teams - Office is a good example - have UE as a central part of the process, but even there it's an unfortunate fact that people are tasked to work on the VNext release of a product rather than supporting an older release. To some extent this is a classical Microsoft problem - how do we pay attention to supporting our existing user base while still striving to innovate and document our cool new tools.

And the third major complication is the large number of diverse files involved. It's easy to make a global fix that affected many docs, and those were easy to fix as part of that project. For instance, there were thousands of files that had bad URLs in their footers or in global links within the docs. Those could be knocked out with a global search-and-replace, and would take very little time.

However, the pain really got heavy when it came to individual broken links in files. We had many docsets where the broken links were all over the place. Pages linked to a tremendous variety of pages, and it would have taken a great deal of intensive work to fix the content. We had to think about bang for the buck, along with the fact that much of the content churned in relatively short order. What to do? Get contractors? Work long hours for basically throwaway work? Push hard on the contributor teams to rev their content? Comment out the oldest broken links?

We did all of those solutions, to be honest, and we probably didn't do that well. I was especially frustrated with the way we band-aided the links, fixing them to make the numbers that were dictated from our management, only to see those numbers jump when the docs rev'd again.

I'm curious to hear what you guys think is the best way to attack the problem of broken links? I've been away from MDSN for ten months now, so this is all past history (and they've been working on a URL schema that should dramatically reduce the number of broken links on MSDN). But we still have to face our own variation of this problem in the Windows SDK. It's not an easy problem, and we could use any suggestions to help us make the right decisions on this.