Debugging data driven development – Doable?

In a discussion with some architects yesterday, I came upon the point that a new form of debugging is needed. So much development these days is data driven, and when something goes wrong you’re back to trial-and-error poking at the problem. I’m sure this isn’t a completely original thought, but I haven’t seen much written about it.

Consider the following three scenarios:

1) You’re writing an XSLT transform for some XML data. If you don’t get the syntax just right, the cryptic error message often doesn’t give much help.

2) You’re using a highly templated class library such as STL. If you don’t get the syntax just right, the cryptic error message often doesn’t give much help.

3) You’re writing a hairy SQL “select” statement. If you don’t get the syntax just right, the cryptic error message often doesn’t give much help.

It’s easy to kill a lot of time “debugging” these problems.

The common denominator in these scenario is that your “code” is really just a data declaration. Some “engine” takes that data and performs “magic” on it, and you see the results, be it new XML, machine code, or a dataset. The problem is that there’s little visibility into these engines. Wouldn’t it be cool if you could “step” through the engine as all the conceptual processing take place?

As I see it, there’s two ways this problem could be addressed. Feel free to pop in with others.

1) The engine (be it XSLT, the compiler or the database) could provide a debugger-like view of its processing, letting you see inspect the intermediate data and control each step of the processing. Obviously, writing something like this would be a lot of work. I have seen tools such as XSLT debuggers, but they don’t appear to be totally in the mainstream. More importantly, it would be great if other engines could do something similar.

2) The engine could be “instrumented” so that it emits intermediate results in some usable format. For this to really work well, the engines should standardize on some sort of output format. For instance, database engines (be they SQL Server, Oracle, or MySql) should generate the same basic format so that you wouldn’t need to wrap your ahead around a completely different format when going from one DB to another. Ditto for compilers.

I know I haven’t proposed any actual workable solutions here, but it’s a problem space where I often find my mind wandering.

Comments (5)

  1. It is very much needed… writing SQL statements is a real PITA, and I can see how messing up XSLT could be bad too (it can vary though, some rendering engines are better at pointing out errors)

  2. KiwiBlue says:

    "Instrumented" compiler would be extremely useful for C++ template metaprogramming.

  3. Tonetheman says:

    I have wanted one of these for a long time with c++ templates. Sadly I end up just breaking down the template by hand and compiling until I have built up the original template.
    <br>For Oracle and SQL in general, it would be nice to have callbacks to the parse phase so you could get a better idea of what was happening. I think that you would end up with problems with the common format though, since most vendors parse differently and probably would not be able to create a common framework or set of callbacks.

  4. Adrian says:

    When debugging these kinds of problems manually, one technique is to temporarily simplify the code/query/declaration. Go back to a canonical form that does work, then in small, incremental steps, add the complexity back in until something goes wrong. (This technique also works for regular code debugging as well as data debugging.)
    <br>Perhaps there’s some way to automate this process rather than trying to make the interpretter’s process more transparent.
    <br>Steve Maguire has some good examples of data debugging in _Writing Solid Code_. He has a disassembler that’s almost completely table driven, so he added sanity checks to search the table for ambiguities, duplication, holes, etc. This gives more specific diagnostics than you could get from a boolean validator.
    <br>Parsers that are auto-generated from grammar rules are usually the worst at giving useful error messages. Hand-crafted parsers can be much better here.
    <br>It’s probably impractical to build a C++ parser manually, but perhaps we could make a handcrafted error message parser that can trim away the noise in an STL syntax diagnostic. Scott Meyers outlines a process for this in _Effective STL_.

  5. IanC says:

    Part of the problem here must be that its difficult for the engine to tell you something about the thing it doesn’t understand. Maybe the key thing here is the difference between syntax and semantics. If the syntactic/lexical analysis fails the engine probably doesn’t have a sensible token tree to emit.

    A pragmatic approach is the ‘interruptable process’ approach, which I think is a combination of your 1 and 2.

    Consider a classic C compiler. Using the (DOS convention) /P, /C, /E and /I options gives lots of insight into what the compiler is making of the stuff you’ve given it. Use /P /C to get a commented but pre-processed output. Now use this as your source code, and the debugger shows you the actual tokens that were compiled, not the higher level macros.

    I’ve tracked down several bugs like this, typically where people have left parentheses off of macros so that when expanded the operator precedence isn’t what they meant it to be.

    I saw one of my team using a nice little tool the other day to see how their regular expressions were being parsed and implemented. This seems like a very similar thing.

    I never got my head around lex. Maybe if the right grammars were built for it it would be able to cover some of the instances of this class of problem?