A critical junction in support issues: Root Cause VS. Relief

In the lifetime of a case, particularly one of high impact, a point is reached where a decision is necessary regarding the direction of the case. This decision impacts how the situation is approached going forward. The decision that must be made is:


“Do I want Root Cause or Relief?”


This decision is important because there are trade offs that have to be made.


You might ask “What is the difference?” In the case of relief, the goal is to restore service as quickly as possible, to determine what is failing and prevent the failure from occurring. In the case of root cause, effort is put forth to understand all the sequence of events and conditions causing the failure in order to pinpoint the specific action that resulted in the failure. Then action is taken to address the real reason of the failure, not just the symptoms of the failure.


So here is cliché analogy (BTW this is a fictious story):


Every 2-3 months my car’s voltage regulator goes bad. An average mechanic will simply locate failed part and replace it. Every 2-3 months the process is repeated. A better mechanic will realize after the second time something must be causing the Voltage Regulator to go bad. After diagnosing the system she discovers the alternator is not producing enough voltage to satisfy the needs of the vehicle. The alternator is not performing within spec because the coils are starting to corrode, therefore he replaces the alternator.


When it comes to computers, there is an additional challenge to Root Cause Analysis (RCA) vs. relief. The process of providing relief (replacing the voltage regulator) in most cases destroys the information necessary (the alternator) to perform root cause analysis. Relief may be to reboot the computer, but there may not be enough information in the various logs produced to determine what happened after the system is available again. Why not? Well, that is a never ending dilemma between performance versus supportability. There are strong arguments on both sides of that debate I don’t want to cover here. RCA is a labor and time intensive process, collecting information and examining the system in a failed state requires additional time which isn’t acceptable to companies operating under Service Level Agreements (SLAs).  In the majority of cases it takes multiple occurrences of the problem in the customer’s environment to allow for RCA to be performed.


It is a common practice of mine to be very clear when working with customers when we arrive at that critical junction. This typically comes when state information could be lost resulting from the actions of providing relief (rebooting, killing a process, restoring, etc.). So can you have both Root Cause and Relief? I have to acknowledge the possibly, however in my experience providing both in the same steps occurs less than roughly 15% of the time.


I am curious to hear back how Root Cause vs. Relief is perceived and valued by your company when dealing with issues.

Comments (10)

  1. The real question is how do you satisfy a customer that wants both RCA and Relief? More often then not we dont have the luxury of providing a RCA because of SLA’s but we are still required to provide one even though we dont know what it really was.

  2. Steve says:

    But at what point is it a loss for Microsoft? If you do not fully invesitgate the RCA and just provide Relief, Microsoft has just lost out on the chance to futher their ability to root out the causes of software issues. Example right now I have been working with MS on a case for over 43 hours (phone hours) MS has a relief answer, but I am determined to get the RCA because I want to further our database of troubleshooting and to save us the call in the future. After all if I experience the issue once chances are good that it will come back in the future and I need to save my company the $245 support call. On the other hand I have had issues where I just want relief and no RCA. So it is a 50/50 issue I guess.

  3. Jeremy says:

    Steve, you are absolutely right. Microsoft does loose out consistently in performing RCA on issues. It has to do with the decisions that are made by the customer during the incident regarding pursing RCA or relief. We have to honor what the customer wants to do and we can only be diligent about communicating this when we believe the junction has been reached, where further steps would jepordize the ability to provide RCA.

    Unfortunatly with the computers, you typically loose evidence of the problem in the steps of providing relief… catch 22…

  4. Bill Wilson says:

    This problem is not confined to software. Sometimes, to create a safe state, you have to take actions that you know will destroy evidence. This is frequently the case in high-hazard industrial settings. Other times, the pressure to simply get things cleaned up and back into production can be very intense.

    In my case (power production and heavy construction), there’s generally somebody on-site at all times that can get in, collect a bunch of evidence, and get out quickly. We have procedures for incident response that cover this. Your case is different — the problem is not at your site, you can’t get in to collect evidence quickly, and the customer wants both RCA and relief, right now.

    That’s a tough situation. I think you’re probably doing the right thing… make it very clear to the customer that immediate relief will seriously hinder any RCA effort, and without RCA, the problem could come back. Then you have to leave the choice to them.

  5. For me it depends in large part on two factors.

    1. The level of business impact the issue is causing. If it is a critical system, root cause (while important) generally has to take a backseat to relief. If the problem has happened before, the balance shifts somewhat I suppose.

    2. The confidence level of the engineer (and of me in the engineer’s ability) that in losing RCA data the steps will actualy provide relief.

  6. Chris says:

    It seems to be a catch 22 for either side. Businesses cannot afford the downtime and don’t care to investigate the RCS. On the other hand, IT typically has to be able to identify the root cause to properly fix the problem.

    Eventlogs and performance logs can typically lead you to the precise cause of failure. But in somecases, it may take an additional call to MS. It’s important to relay everything that has happened to the support engineer in order to properly get a solid RCA.

    All your really doing when contacting MS is brining in an additional resource for assistance. All they know is what you tell them. It goes for any outsourced consultant.

    Althought, uptime is the most important thing, identifying the RCA has to be addressed by all times to fully support and update any SLA.

  7. David P says:

    This is a question that rarely comes up for your typical firefighter: relief (put out the fire!) goes before root cause analysis (it was a pan of grease on the stove). Much easier prioritization decision!

    It’s a balance, just like you have a balance when deciding whether to take all your servers down right now to apply the latest series of security patches. Is the risk of a self-inflicted denial of service higher than the risk of lost sales while I reboot? If so, you might want to consider patching over the weekend. Is the risk of an exploit and associated losses higher than the risk of possible lost sales? Then bring’em down, cowboy.

    The balance in this case is based on, amongst other things:

    A- how much the analysis time is costing in lost productivity or business opportunity while the problem is ongoing, vs. how quickly I could apply relief and reduce that cost

    B- how much longer it will take to perform further analysis, before I can determine root cause

    C- how soon the problem will reappear if I just apply relief. If I reboot tonight and the problem goes away for two years, I’ll be much more willing to select "relief" than a scenario where I’ll be called back in an hour.

    D- the level of confidence I have in my answers to the above questions. Also, the level of confidence I have that the Relief I will be applying will actually work.

    E- the value of the data lost when I apply relief. If logs are critical to resolving the root cause, and I will lose them completely by applying relief, I’m less willing to do so. If I can apply relief and still work on RCA, that’s more palatable.

    F- SLAs. If my company gets financial penalties on downtime per incident, my incentive to search for root cause may be diminished. If I get penalties on cumulative downtime, I may want to resolve the problem for good, in which case I want root cause.

    G- closely related to the above: how loudly is the client shouting in my ear?

    H- how long I’ve been working on the problem with no forward movement. I’m more willing to provide relief and throw in the towel on RCA if I’ve been trying to fix the problem for 48 hours and I don’t feel I’ve made progress. If I feel the solution is "just around the corner" I’m more willing to continue analysis.

    I- how much sleep I’ve had in the past 48 hours, and whether the coffee in the breakroom is any good.

    The last one is not completely in jest. Root Cause analysis often requires a sharp, focused, alert! mind that is in tune with the environment and can detect minute anomalies or variations from the norm. Rebooting takes one binary brain cell and an index finger.

    All of the factors above are balanced in the equation:

    X = lim (( A / I^2) * (B – G/D) + C)

    H -> F

    Plus a constant, of course. As X trends to 1, I’ll be more willing to just go to Starbucks rather than drink any more of that overwarmed pot sludge.