What to do with a feature that only works 90% of the time?

Imagine when you’re designing a feature if there was an operation that was very useful 90% of the time; but the other 10% of the time it was provably and innately unsafe (either crashed, deadlocked, or gave back garbage). By “innately unsafe”, I mean there’s some intrinsic quality about the feature’s requirements that make it dangerous, and it’s can’t be addressed with just a bug-fix. So do you kill the feature and have people riot about losing that 90% usefulness, or hand people the loaded gun with a kind readme.htm asking them not to pull the trigger?

There are lots of examples here around allowing func-eval (FE). That’s a dangerously useful operation, but not always safe. For example:
– FE if a thread is suspended at an gc-unsafe region, and you will deadlock if the FE does a GC.
– FE method from AppDomain 1 with arguments from AppDomain2, and you could break AppDomain isolation (and then the whole world could explode, likely manifested as a crash in the GC).

How could something be innately unsafe?
Ideally, the error cases are all well-defined, and then you draw the line perfectly and the error checks prohibit only the error case. But that’s not always possible. Perhaps the “safe” cases are not well defined (eg, such as what exactly makes cross-AppDomain contamination bad?). Or you can argue that the 10% failure zone is overly conservative and only includes some case XYZ because the design is lame. For example, you could claim that there’s no innate reason that the GC needs threads stopped at GC safe places and thus it should still be able to proceed (perhaps doing a “partial” GC). But now you’re talking about a radical change, and that new design will be much more complex and may very well just have different problems.

The bottom line is that at the end of the day, you’re often left deciding which side of the fence to err on: Do you allow a useful feature that will occasionally annoy your users (crash or deadloc)? or do you play it safe and eliminate the feature?

What do you do about it?
I think that depends on what your goals are. If you’re building a “bullet proof” component, then the answer is obviously protect the subsystem and err on the side of safety. For example, user-mode apps are absolutely not supposed to be able to crash the kernel and blue screen, especially not by just giving invalid input to APIs, and so windows fortifies the APIs and prunes unsafe operations. For example, that’s part of the motivation for deprecating IsBadWritePtr even though realistically it would probably be “correct” most of the time.

What about ICorDebug?
In V1.0, we classified ICD as a “rocket science” API, and had very few error checks. In V2, we’ve moved it to a much more fortified position (though still nowhere close to bullet proof). This has come up in a lot of different ways with func-eval. To address the issues above:
– FE at unsafe points: We originally wanted to fail a func-eval if any thread (not just the FE thread) was at an unsafe point; but it turned out that was too restrictive. So instead we gave debuggers the ability to decide if a thread is at a safe point, and thus let them make the policy decisions. Furtheremore, it turns out there’s a debugger can break the deadlock (by resuming the threads).
– For AppDomains, we added a lot more safety checks. We figure FE is dangerous enough that we need to err on the side of safety.

Comments (1)

  1. Process Shutdown is evil , as Raymond Chen recently blogged about in wonderful detail. This prompts me