Code behaving badly

Wow, two programming/C# enties in a single day…

Before I joined the C# team, I was the test lead for the C++ compiler for a number
of years. We would periodically get customer comments that “the compiler was
broken”, and upon further investigation, we would usually find that it was a bug in
the program. There was usually a good correlation between the amount of experience
of the programmer – those with more experience normally suspected their code first,
and only after careful research would consider the compiler (and they were usually
right at that point).

One of the nice things about the C# compiler not having pointers is that it’s much
harder to accomplish bad things (“Try to imagine all life as you know it stopping
instantaneously, and every molecule in your body exploding at the speed of light”.
Shame on you if you don’t recognize the quote). If you’re playing the interop game,
you’re back in the pointer-world of sharp sticks, and you can easily create the otherwise
elusive “Execution Engine Error”.

Last week, I upgraded to build 30730 of VS and the runtime. (This means “third year,
seventh month, and 30th day”, and is also known in Microsoft parlance as the
“Julian Date”, even though is isn’t a julian
. This replaced our previous scheme (also not a julian date) that we used
on VS 2002 and 2003, which replaced the scheme we used in VS6 (also not
a julian date). As far back as I remember, our numbers had always been called julian
dates but never were. An ideal dating system is monotonically increasing by 1 (so
you can tell how far apart builds are) and easy to convert to human-readable dates
(so you know when the build was created), but that’s not really possible, so at least
we’ve finally settled on something where you know when the build was, and it works
for more than a couple of years (previous versions broke badly when confronted with
the long dev cycle of VS 2002). It’s a testament to the understandability of the previous
schemes that I don’t remember what they are, but I do know that many people ran
little JDate applications on the desktops so they knew what jdate to use for today.
But I digress)

I got the new build on, and nothing broke (a nice thing occurance), rebuilt, and ran
my app. It worked fine in most areas, but when I tried to use one function, I got
an null reference exception. Of course, I initially thought my code was bad,
but a little debugging narrowed the problem down to an innocuous-looking function:

		private void CheckType<T>
        (DBObject node, List<int>
            list) { if (node is T) { if (node.Checked) { list.Add(node.ID); } } } 

In my app, I have a treeview with different node types in it, and I need to get the
list of all check nodes of that type into a list so I can persist it. This function
is called for each node and each type of node, and it fills in the items.

All the parameters were correct on being passed in, but when they get into the function,
list is nowhere to be found, and calling list.Add() causes problems. Since this code
worked before and the debugger couldn’t find list, I started to suspect a code generation
problem. Further investigation showed that even if list.Add() was never called, the
program would blow up at some future point.

I just finished a session with one of the CLR guys to try to find the root cause and
get a small repro case (small repro cases are the holy grail of tracking code generation
issues). He knew that there had been some changes in JITting generic methods when
one of the parameters was a MarshalByRef type, and we were able to create a small
project that throws an ExecutionError at will. That will allows us to find the problem
and get it fixed.

The moral of the story – and I’m sure if you’ve read this far you’re expecting a moral
– is that while it’s usually your code that has the problem, sometimes it’s the underlying
system that has issues, so don’t be too trusting…


Comments (4)

  1. Phil says:

    Test harnesses where you are able to reproduce the problem are a godsend. Now I create the test harness before I start coding on the actual product. I am debugging something right now in C++ that works fine on my machine in release mode, but blows up on the production server. The test harness at least helps me to get a view of what’s going on on the production box.

  2. "It would be…bad."

    "Important safety tip, thanks Egon."

    Gotta rent that again…what a great movie!

    And an interesting post…and people think Microsoft never admits to making mistakes. 😉

  3. MBA says:

    Helpful For MBA Fans.