In Search Of The Missing Link

Every now and then I get to write actual code rather than just documentation. Usually there's either a crowd watching in amazement that I can actually find Visual Studio, never mind knowing some of the magic keywords that make it all work when you press the green arrow button. Or else everyone is cowering behind their desk in case my computer can't cope with the culture shock and explodes. Isn't it wonderful when everyone has so much faith in your capabilities - after all, I've read the .NET Architecture Guide (endlessly, as I've been working on it for the last year) so I ought to know a bit about this stuff.

Unfortunately, as I've rambled about in previous posts (such as "How p&p Makes Cheese Sandwiches"), my programming tasks tend to involve building kludges and "temporary fix" tools to solve problems that are either too esoteric to be of any use to people outside our small documentation team, or are a stop-gap until the proper tool gets upgraded next time. OK, so I used to be a consultant and I wrote a few Web apps for customers, but I'd hesitate to publish "best practice compliance" figures for those. Especially as they were also usually kludges required to get software the company had paid big money for to work how they wanted it to.

So, anyway, last week I decided to put together a rough tool to help us find broken and incorrect links in a documentation set that builds to create a CHM or HxS file. The issue is that, although the authoring tools we use can create links between topics and within topics, you can't easily (or at all in some cases) check if these links are valid. They may point to a topic that you removed, or the target topic may have changed (so has a different auto-generated topic filename), or they may just point to the wrong topic. We do use a link checker utility to find broken links, but it can't find links that go to the wrong topic or links that point to an anchor (bookmark) in the same page that does not exist. The only way we can verify these are through manual testing ("click and read").

In theory, the process for automatically checking the links that the link checker cannot verify sounds relatively simple. Every topic page is an HTML file generated by the documentation tools from the source Word documents. The text of a link that points to a separate topic should have the same text as the title of that topic. And a topic containing an in-page link should contain an anchor that matches the bit after the "#" in the link href. So it's just a matter of applying some processing to each HTML file to verify these rules. I could do it by reading the pages one by one using MSXML, or just open them as text files and read the source that way.

I chose the second approach for no better reason that it seemed easier and quicker to build, and because I already had most of the code from another tool that did similar stuff to update various sections of the source files (such as inserting feedback links and index entries). All it involves is some judicious string handling, searching, and text comparisons. Of course, it got more complex as time went on because I found I needed to allow for optional settings such as allowing the case of titles to differ, and ignoring leading and trailing spaces in topic titles. But it was fairly easy to stir in a mixed selection of semi-appropriate keywords and variables, and bring the whole lot slowly to the boil.

Did I use agile development methods, as I know I should (especially after working with the p&p dev teams for so long)? Well, if you can call throwing it together and fixing broken bits afterwards "agile", then yes. But, in reality, no. I should have started by writing tests, but the tool simply reads the files and dumps the results out as a log file so how do I write tests for that? Probably I should have divided the task into a series of separate routines for each minor action, and written tests for each one, but that seemed overkill for such a simple project. I did all the testing as I went along by adding errors into a set of existing source docs and then checking that each one was detected as I added features to the code.

After it all seemed to be working, confirmed by fairly comprehensive testing by my advisory panel and beta test team (Hi, Nelly), I decided I should do the agile "refactor" thing. Except there was just one chunk of code that got used more than once (twice in fact), and it was only three lines. Perhaps I could write some generalized code routine with half a dozen parameters for some of the functionality, and call it from the three or four places in the code that seemed appropriate. Would that be more efficient than repeating half a dozen lines of code in four places? It would be easier to maintain, but probably take time to code, fix, and retest everything again.

Next, I did the "make it more efficient" thing. One thing the code needs to do for each inter-topic link it finds in every page is extract the title from the linked page (if it exists) to see if it matches the text in the link. First time round, I had a routine that opened the file, located the <title> element, and returned the contents. But this meant I was opening every topic multiple times; for example, every page contains a link to the contents topic so it gets opened and read for every page in the doc set.

Obviously this is hugely inefficient, even if .NET caches the file contents. So, I simply added code to save each title it found in a Dictionary using the file name as the key, and then look in the Dictionary before reading the disk file. That way the code only opens and reads each topic page once. Makes sense, but when I ran the code again it made almost no difference. I did some comparative tests, and the reduction in time taken to process the doc set was averaging somewhere around 5%. And as the tool, running on a fairly average desktop machine, can process a doc set containing 465 HTML topic files in less than 5 seconds, how much time and effort should I put into fine tuning the code? Especially as you only run the tool a few times at the end of a project after the docs are complete and you are ready to build into a CHM or HxS...

Don't get me wrong, I'm not in any way suggesting you should ignore the principles of coding best practice, agile development and pre-written tests, and proper pre-release validation. But, sometimes, just getting stuff done by relying on the power of modern machines and software environments does seem appropriate. Though I guess I'm not likely to get any jobs building "proper" enterprise applications now I've 'fessed up...