Learning to Love C#

At heart, I'm a programmer. I've been a programmer for 30 years; my first programs were persisted (a word I don't believe we had then) to punched paper tape, and I remember how a magnetic tape system called DECtape changed our lives. Although it isn't what I do full-time for a living now (that would be mostly going to meetings :)), I still love programming. I had the opportunity the other day to write a program in C# to do some data conversion. We're moving in Office 12 from the old HTML Help .CHM files for offline storage to a new MS Help 2.0 format called HXS. One challenge here is capturing all the context sensitive help information embedded in the CHM files that unfortunately for arcane historical reasons never actually made it into our help authoring store. Instead, these are stored in files that associate symbolic tags with numbers like this:

#define olconUnderstandingFormsCache 1045139
#define olconUseFieldChooser 1045140

These allow the program code (Outlook in this case) to use the symbol "olconUnderstandingFormsCache" when it wants to make a context-sensitive help call about, well, understanding the forms cache, I guess. The help system gets the far more prosaic context-sensitive ID 1045139 which it uses to find the correct topic to display. By the way, I should mention that one thing that's changed in our context-sensitive help display is that multiple topics can be found, not just one (technically: multiple topics can have the same context sensitive ID number), and the full help experience, including browse and search is always available so if the context-sensitive topics we think to link to an entry point don't do it for you, you can always poke around for something else.

We needed a tool to parse all these legacy files (Office has about 200 of these) and put the data into a SQL database we use to generate the context-sensitive mapping files used by our new help system. Our full-time developers were busy doing more important things like getting the colors of the icons in the help window right and fixing bugs with titles like "Convert a freetext query for a short assetid into an ID query for the full AssetID", so I volunteered to write this code. 

I was only vaguely familiar with C# when I started and I have to say: I love it! One reason is of course the depth and breadth of the .NET library - there pretty much is a class for everything and most of them work the way you'd like them to work. Not having to worry about memory management is of course wonderful. But I just now discovered another thing that makes programming in C# with the CLR a joy.

Part of the job of my little tool is to parse the lines above into their component parts. Of course, .NET has a great class for doing this called RegEx. RegEx encapsulates a regular-expression; when you throw text at a regular expression, it gobbles it up according to the rules of the regular expression and returns a nay or yea - did it match? As a side-effect, it also can pull out and squirrel away parts of the input text. This is exactly what I need for the lines above - pull out the symbol and the number. Of course, the whitespace between them is variable - it may be spaces, tabs, or any combination. No problem for the regular expressions - so here's my regular expression to match lines like that:

Regex regexDefineMatch =

new Regex( @"#define\s*(?<token>\S*)\s*(?<value>[0-9]+" );

This basically says match text that is

  1. the literal "#define"
  2. followed by any amount of whitespace (the \s*)
  3. followed by any amount of non-whitespace as the token (squirreling it away as "token" for later reference),
  4. followed by any amount of whitespace
  5. followed by the numeric ID - a string of digits (which is squirrelled away as "value").

So far so good. The code to apply this to a string read from the file of mappings like that above looks like this:

Match defineParse = regexDefineMatch.Match( strLine );

if (defineParse.Success)
{
   string strToken = defineParse.Groups["token"].Value;
   string helpID = defineParse.Groups["value"].Value;
...

strLine is a string representing a line read from the file, and the Match method of the Regex class applies the regular expression to the string, returning Success if it matched. Then I pull out the component pieces and start updating the database.

I ran the code and it clearly was doing something - I had put in some UI update to show which file it was reading and it plowed through all the files. But nothing got updated. Hmmm... Now in the bad old world of C and C++, I'd pull out my debugger and start stepping through code, to see where it went wrong. But in the good new world of C#, I saw these lines on the output console:

Unable to update checked-in page from InterPress. Details: parsing "#define\s*(?<token>\S*)\s*(?<value>[0-9]+" - Not enough )'s.

Now the "Unable to update checked-in page from InterPress." text in the above came from my code; it was in a top-level exception handler that looked like this:

catch(Exception ex)
{
ip.LogInfo("Unable to update checked-in page from InterPress\nDetails:\n" + ex.Message, true);
}

So deep in the bowels of my code, I called the regular expression parser and it found a typo in my regular expression. Normally, it would have taken me maybe a half-hour of stepping through code, finally isolating the line that wasn't working (the call to Match) and then puzzling over why it wasn't working. But here the Regular Expression class conveniently told me exactly what the problem is: I am missing a closing parentheses in the regular expression. This didn't happen because I put error handling in everywhere, but because I had a top-level exception handler that said, hmm, something went wrong below me and here is what it is (ex.Message).