Effective troubleshooting is an important skill not only in the information technology field, but in many other professions. Auto mechanics, electricians, counselors and doctors are essentially professional troubleshooters. They identify problems, theorize solutions from their knowledge and experience, and systematically test them whether they be a new alternator or a prescription for high blood pressure medication. I’m not a doctor, but being a good troubleshooter makes me feel like one. In this article, I’ll share an age-old process for troubleshooting, “RV-AIR,” that was taught to me as an electronics engineer in the armed forces.
The first step of effective troubleshooting is to recognize that a problem exists. Obvious, yet often ignored, problem recognition is the single most important troubleshooting step. That’s because the way in which we discover a problem impacts how we fix it, and can be the difference between treating the symptoms and curing the “disease.” Problem recognition can be categorized as reactive or proactive, and describes our capability on a continuum from immature to mature respectively. Reactive recognition is discovering a problem only when someone reports it. Some might call this “fire fighting.” When we are in a reactive state, we don’t know when or from where the next problem will arise, and the time we spend “putting out fires” takes us away from what we want, or need, to work on. On the other end of the spectrum is proactive recognition in which problems are identified as soon as or before they occur. Using the doctor-patient analogy, proactive recognition is a preventative checkup that identifies high blood pressure and prevents a heart attack, and reactive recognition is discovering high blood pressure after having the heart-attack. For software systems, proactive recognition is a health monitoring solution such as Microsoft System Center Operations Manager, but the key concepts of proactive maintenance and health monitoring transcend systems and disciplines.
Once a problem has been reported either by a person or a system, the next step is to verify it. It is very difficult to troubleshoot a problem that cannot be reproduced, and it is equally frustrating to waste time troubleshooting something that was never broken in the first place. As much as we like our customers, the PEBKAC (Problem Exists Between Keyboard and Chair) factor applies. I’m not saying that user’s are stupid, but experience dictates that you should trust your customers, but not their problem reports. Let me illustrate this point with an example. Your newly licensed teenage daughter calls you in a blind panic that the expensive car you lent her won’t go into gear. What could be wrong? You begin theorizing, “maybe she ran over something,” “maybe when I got my fluid changed, they didn’t put the plug back in.“ Then you jump to a diagnosis and a decision, “The transmission is broken, so I better call a tow truck and make an appointment at the dealership.” You have the car towed to the shop and spend hundreds of dollars to discover that nothing is wrong; your daughter simply forgot to press the brake pedal before putting it into gear! The moral of the story is that problem verification saves time, money and relationships.
After a problem has been recognized and verified, the next step is to analyze it. During this step, the symptoms of the problem are analyzed and a set of possible causes, or theories, are identified. The result is a ranked list of potential solutions that will be systematically proved or disproved in the next step. Knowledge and experience plays a large role in how you arrive at your theories and the more troubleshooting you do, the better you will get at it. Strive to identify root causes and permanent solutions. Leverage the experience of others by thoroughly researching the problem on your own first before asking others. Effective troubleshooters are effective researchers, and they know where to look, how to filter out noise and identify useful nuggets of information. From your research and analysis, you will have a list, at least in your head, of the potential causes. Rank the causes from most to least likely and identify a solution for each.
After analyzing the problem and identifying the most likely causes and solutions, the isolate step is a systematic process of elimination. For each of your theories, from most to least probable, test the solution in a non-production environment (if possible) so that you don’t introduce new problems. Go slow and apply only one change before each test. Beware of false positives and validate the root problem is solved and not just the symptoms. If you are unable to find a solution, then you may need to collect additional data and return to analysis. Once you find a solution, the next step is to create a plan for applying the fix.
The final troubleshooting step is to permanently repair the problem by applying the solution identified in the isolate step.
RV-AIR is a proven process for finding and fixing problems and is equally effective for tracking down a memory leak as it is for finding a shorted electrical outlet in your house. The key to effective troubleshooting is to recognize, verify, analyze, isolate and repair. Happy troubleshooting!