Random Repro vs. Consistent Repro, and a surprise under the cover…

(Continuing on my previous post on supportability)

   Supportability Principle # 3 – When writing code, avoid random, hard-to-diagnose, unpredictable failures at all costs.

I previously mentioned that good predictability means less costs in supporting your product, and the fact that predictability should permeate your whole design, development and testing efforts. But I don’t feel that I did a through job in describing the true advantages of a predictable design.

The best example that shows the power of predictability is the subtle distinction between “random repro” and “consistent repro“. The difference between randomly-occuring bugs and bugs that happen consistently all the time is not something that people usually think about when writing code. But, just think a little bit: how much time it takes to hunt a random bug during testing, compared with a consistently-reproducing bug? During testing, you have to spent probably 10x-100x more time in hunting and understanding random bugs. So, this distinction seems to be a rather important point, at least from a “testing cost” point of view.

Not only that, but you get less confidence that your test suite found all the random bugs in your code – as opposed of having 100% confidence that you get all the consistent bugs, even in one test run.

And, in fact, this cost ratio dramatically increases to 1000x-100,000x when your code is in production! Think of it: your support engineer might spend weeks, if not entire months, trying to understand the conditions that surface a certain random bug! In contrast, would the bug had been a “consistent repro”, the customer would have been able to repro the bug in 5 minutes, and even send you exact repro steps in email. Instead, your support engineer is forced to spend  months in figuring out what’s wrong. And three months divided by five minutes is around 25,000. That’s usually the cost ratio between supporting code that contains a random bug vs. a bug that reproduces consistently.

So – here is an interesting challenge – take any piece of code that you wrote in the past. Try to remember any random bugs you had. If you would get a second chance to rewrite your code, how would you redesign it from scratch, in order to avoid any types of random bugs?

Maybe there is no universal advice here – and it might be a good idea for a brainstorm – what coding practices would reduce this ratio? Here are some examples that I can think about:
– Avoid “unsafe” coding style, that could cause buffer overruns. Prefer using automatic memory-management techniques instead of manually calling “malloc/free”. If you can program in VB/C#, even better. C++ is not for everyone, C even less. But if you program in C++, use STL or STL-style classes as opposed to C-style programming.
– Isolate software modules as much as possible, even if you take, say, a 10-20% perf decrease by spending time in additional communication overhead. Yes, I know it’s less performant, but a good software enginneer knows how to optimize for performance without sacrificing principles like inter-module isolation and reliability.
– Avoid spaghetti-code like hell – each module should maintain its own state. Avoid sharing state across many modules. Use OOP or even procedural programming to drive your flow instead.
– Suspicious code, or low-quality code should live in its own process. If the memory is corrupted, only that process sufferrs – memory corruptions will not have ripple effects on other modules.
– Also, it is best to not have long-running processes. Yes, it sounds weird, but it’s logical – if you have a bug causing memory corruption, the probability of that corruption affecting your process grows asymptotically to 1 over time. So, short-living processes have less chances to be affected by corruption than long-lived processes.
– As often as possible, you should  check the consistency of your data structures. Don’t assume that if you put the value “1” in variable “X”, you will still have “1” ten minutes from now. There might be a memory corruption that affects your variable, and the sooner you detect the problem, the better.
– Avoid putting stuff in kernel-mode. Move as much logic as possible in user-mode, again, with the necessary tradeoff against performance, etc.
– Do not eat exceptions – use reporting mechanisms like Dr. Watson, or the ReportFault( ) API. Dr. Watson is your friend.
– And, finally, be extremely aware about multithreading programming. Whenever there are two threads sharing state, you have a much higher potential of subtle memory corruptions than in normal code.

Do you have any other advices? Let me know in your comments area…

[update – minor fixes]

Comments (3)

  1. Oh Well says:

    Do not reference count manually. Please use smart pointers or something. But never ref count manually. Always run the test in chked, verifier environments. For every potential memory corrupting statement you write, think about how it could corrupt memory and why it wouldn’t and add asserts for this case and if asserts are not possible add comments. I think using VirtualPC might be useful to catch some data breakpoint type or disk I/O type peeking debugging but I haven’t tried it or seen anyone use it.

  2. AdiOltean says:

    Yes – I completely agree about the manual ref count…

  3. haseeb qadir says:

    Do not eat exceptions – use reporting mechanisms like Dr. Watson, or the ReportFault( ) API. Dr. Watson is your friend.

    One Windows XP, Server 2003 and Vista it’s best to not to do anything at all. Let the OS’s error reporting mechanism do it’s magic and report the failure to Microsoft. You don’t even need to call the ReportFault api.

    Once an unexpected exception occurs your program is an unknown state (heap might be corrupted, variables might be in an unexpected state). Simple things like a malloc call might fail or worse, cause another exception in your exception filter. Trying to execute code after the fact is generally not a good idea.