A critical junction in support issues: Root Cause VS. Relief

In the lifetime of a case, particularly one of high impact, a point is reached where a decision is necessary regarding the direction of the case. This decision impacts how the situation is approached going forward. The decision that must be made is:

 

“Do I want Root Cause or Relief?”

 

This decision is important because there are trade offs that have to be made.

 

You might ask “What is the difference?” In the case of relief, the goal is to restore service as quickly as possible, to determine what is failing and prevent the failure from occurring. In the case of root cause, effort is put forth to understand all the sequence of events and conditions causing the failure in order to pinpoint the specific action that resulted in the failure. Then action is taken to address the real reason of the failure, not just the symptoms of the failure.

 

So here is cliché analogy (BTW this is a fictious story):

 

Every 2-3 months my car’s voltage regulator goes bad. An average mechanic will simply locate failed part and replace it. Every 2-3 months the process is repeated. A better mechanic will realize after the second time something must be causing the Voltage Regulator to go bad. After diagnosing the system she discovers the alternator is not producing enough voltage to satisfy the needs of the vehicle. The alternator is not performing within spec because the coils are starting to corrode, therefore he replaces the alternator.

 

When it comes to computers, there is an additional challenge to Root Cause Analysis (RCA) vs. relief. The process of providing relief (replacing the voltage regulator) in most cases destroys the information necessary (the alternator) to perform root cause analysis. Relief may be to reboot the computer, but there may not be enough information in the various logs produced to determine what happened after the system is available again. Why not? Well, that is a never ending dilemma between performance versus supportability. There are strong arguments on both sides of that debate I don’t want to cover here. RCA is a labor and time intensive process, collecting information and examining the system in a failed state requires additional time which isn’t acceptable to companies operating under Service Level Agreements (SLAs). In the majority of cases it takes multiple occurrences of the problem in the customer’s environment to allow for RCA to be performed.

 

It is a common practice of mine to be very clear when working with customers when we arrive at that critical junction. This typically comes when state information could be lost resulting from the actions of providing relief (rebooting, killing a process, restoring, etc.). So can you have both Root Cause and Relief? I have to acknowledge the possibly, however in my experience providing both in the same steps occurs less than roughly 15% of the time.

 

I am curious to hear back how Root Cause vs. Relief is perceived and valued by your company when dealing with issues.