What is a support incident?

I believe there is no better way to start talking about support than starting by the beginning: the support incident. We can find many definitions in the vast literature about the subject, and I believe there is common sense about it. Generally speaking we can define an incident as an indivisible event that causes a malfunction - or is the effect of a malfunction - in a product; in our case, a Microsoft product. For a support engineer that will help a customer to address the problem that is causing the malfunction, the incident starts when he or she receives a notification from the customer; for the customer, the incident starts earlier - and sometimes a lot earlier. Sometimes the customer will call in the Microsoft support center after his or her customer - or the end user - has called in and reported that something was not working properly. There is therefore a gap between the time when the event happened and the time our support is called to help the customer; I will return to this in future postings.

After the support incident is reported to an engineer, this person will help to resolve it: the resolution is always a joint effort where the customer's team and our support team in CSS will work together to resolve the problem. In some situations, the engineer will provide a comprehensive action plan that will solve the problen; in others, the troubleshooting steps and documents sent to the customer will help him or her to build the action plan that will resolve the problem.

In the lifecycle of a support incident, from a support engineer standpoint, we can enumerate 3 phases: scoping, troubleshooting and implementation. In the scoping phase, the engineer and the customer must start working by agreeing on a definition about what needs to be fixed: what is the objective to be pursued and what resolution will look like. The troubleshooting phase will allow the engineer to define what data needs to be collected to serve as evidence of the root cause of the problem; then a hypotesis will be elaborated in order to address the cause of the problem. The data collection will rely in many tools like event logs, network traces, performance logs, memory dumps, and sometimes private instrumented binary code that will allow specific data collection. The implementation phase should happen only after the troubleshooting has successfully identified the solution or workaround that will allow the objective to be fullfilled. Unfortunately due to possible limitations the solution may not always be reached.

In future posts I will try and elaborate more in this subject by detailing each of these steps.

Saludos!

Deo Melgaco