Random Repro vs. Consistent Repro, and a surprise under the cover...

(Continuing on my previous post on supportability)

   Supportability Principle # 3 - When writing code, avoid random, hard-to-diagnose, unpredictable failures at all costs.

I previously mentioned that good predictability means less costs in supporting your product, and the fact that predictability should permeate your whole design, development and testing efforts. But I don't feel that I did a through job in describing the true advantages of a predictable design.

The best example that shows the power of predictability is the subtle distinction between "random repro" and "consistent repro". The difference between randomly-occuring bugs and bugs that happen consistently all the time is not something that people usually think about when writing code. But, just think a little bit: how much time it takes to hunt a random bug during testing, compared with a consistently-reproducing bug? During testing, you have to spent probably 10x-100x more time in hunting and understanding random bugs. So, this distinction seems to be a rather important point, at least from a "testing cost" point of view.

Not only that, but you get less confidence that your test suite found all the random bugs in your code - as opposed of having 100% confidence that you get all the consistent bugs, even in one test run.

And, in fact, this cost ratio dramatically increases to 1000x-100,000x when your code is in production! Think of it: your support engineer might spend weeks, if not entire months, trying to understand the conditions that surface a certain random bug! In contrast, would the bug had been a "consistent repro", the customer would have been able to repro the bug in 5 minutes, and even send you exact repro steps in email. Instead, your support engineer is forced to spend  months in figuring out what's wrong. And three months divided by five minutes is around 25,000. That's usually the cost ratio between supporting code that contains a random bug vs. a bug that reproduces consistently.

So - here is an interesting challenge - take any piece of code that you wrote in the past. Try to remember any random bugs you had. If you would get a second chance to rewrite your code, how would you redesign it from scratch, in order to avoid any types of random bugs?

Maybe there is no universal advice here - and it might be a good idea for a brainstorm - what coding practices would reduce this ratio? Here are some examples that I can think about:
- Avoid "unsafe" coding style, that could cause buffer overruns. Prefer using automatic memory-management techniques instead of manually calling "malloc/free". If you can program in VB/C#, even better. C++ is not for everyone, C even less. But if you program in C++, use STL or STL-style classes as opposed to C-style programming.
- Isolate software modules as much as possible, even if you take, say, a 10-20% perf decrease by spending time in additional communication overhead. Yes, I know it's less performant, but a good software enginneer knows how to optimize for performance without sacrificing principles like inter-module isolation and reliability.
- Avoid spaghetti-code like hell - each module should maintain its own state. Avoid sharing state across many modules. Use OOP or even procedural programming to drive your flow instead.
- Suspicious code, or low-quality code should live in its own process. If the memory is corrupted, only that process sufferrs - memory corruptions will not have ripple effects on other modules.
- Also, it is best to not have long-running processes. Yes, it sounds weird, but it's logical - if you have a bug causing memory corruption, the probability of that corruption affecting your process grows asymptotically to 1 over time. So, short-living processes have less chances to be affected by corruption than long-lived processes.
- As often as possible, you should  check the consistency of your data structures. Don't assume that if you put the value "1" in variable "X", you will still have "1" ten minutes from now. There might be a memory corruption that affects your variable, and the sooner you detect the problem, the better.
- Avoid putting stuff in kernel-mode. Move as much logic as possible in user-mode, again, with the necessary tradeoff against performance, etc.
- Do not eat exceptions - use reporting mechanisms like Dr. Watson, or the ReportFault( ) API. Dr. Watson is your friend.
- And, finally, be extremely aware about multithreading programming. Whenever there are two threads sharing state, you have a much higher potential of subtle memory corruptions than in normal code.

Do you have any other advices? Let me know in your comments area...

[update - minor fixes]