Helping customers troubleshoot production issues


In my previous role I was an Escalation Engineer in the Windows Platforms team (a big shout out to the fine folks at the NTDebug blog). I don’t get to debug that deep any more, but I do work with my customers operations teams to troubleshoot production issues. We do end up opening support cases so that Microsoft Premier Support Engineers can work with the customer and figure out root cause.

Once a case is opened and Support Engineers are engaged, we are able to quickly determine root cause and provide a resolution. But I find that customers usually spend a lot of time before reaching out to Microsoft or opening a support case. Think about the last time you had an unexpected issue or failure. How did you troubleshoot the issue, or narrow down the area of investigation? Most customers end up researching online for possible solutions, upgrading random\suspected drivers, praying to the IT gods, etc. They cross their fingers and hopefully one of the random changes solves their problem. We used to refer to the above mentioned technique as Shotgun troubleshooting! (A shotgun cartridge contains many pellets, and when fired they spread out. Hopefully one of the pellets hits the target). Sometimes this technique does resolve the customer issue, but you have no idea what change resolved the issue. Sometimes the problem just gets masked by the changes, and the underlying issue reappears at an even more critical moment.

I had a customer recently running into an issue where their Windows Service hangs at random times during the day. This was impacting their test teams. They were never able to complete their testing since the service stops responding.
They contacted me 2 weeks ago for advice. It turns out they had been running into this problem once or twice for the last 3 months. During that time, they had moved the guest VM to a different host, changed the SAN that the VM resided on, upgraded drivers, installed the latest service packs, changed the antivirus solution, etc. However, nothing worked.

  1. Rather than continuing changing more variables in an already complicated environment, I spent some time with their engineering team to break down the issue a bit so that we could focus the investigation. We started with a quick lesson in Computer Engineering 101.
    Each programs, or window service run inside a Process. A Process is an unit of isolation.
  2. A Process is a collection of Threads, Code, OS data structures, etc. 
  3. All code runs on a particular Thread. Each Thread runs on a CPU. Only one Thread can run on a CPU at a given time. If multiple Threads are in a Ready state, they are scheduled on the CPU one after another. Each thread runs for a finite amount of time before it yields to the next Thread waiting for time on the CPU.
  4. Every Process has a defined starting point defined in Code. When the OS starts a Process, the starting point is loaded into Thread 0. At that point, Thread 0 is scheduled to run on a CPU and execution begins shortly after. Usually Thread 0  spawns other worker Threads, and then itself begins to listen for messages from the OS, user, other processes, etc.
  5. Threads usually allocate memory in RAM to store data. Threads can also read\write to Disk for persistent storage. Lastly, Threads also talk to other computers over the network.
  6. Each Thread has 2 portions of Code. The User mode portion is code written by developers(including my customer). The other portion is called Kernel mode. This is Code inside the OS doing work on behalf of the User mode portion.

Based on the above, we talked about the 4 big bottlenecks that a process\thread can run into : CPU, RAM, Disk, Network. Armed with this knowledge we started the process of elimination to understand what was going on. Users perceive a process, or computer to hang when it is not responding to their requests. I usually classify a Hang as either a Low or High CPU hang:

  1. High CPU Hang - The CPU is consumed at a 100%. Threads cant get time on the CPU to respond to a user request in time, and therefore seems unresponsive. High CPU hangs are easier to solve since you can see what is using up all the CPU time, and therefore get to root cause easily.
  2. Low CPU Hang - The process is not CPU bound and is probably waiting on some other resource. It could be waiting on RAM, disk\IO requests, or maybe even a request to another computer over the network. Sometimes threads can also block each other when they try to request the same specific resource exclusively. Since only thread can exclusively acquire the resource, the other threads end up waiting till the first one releases the resource.

You can use a tool like Sysinternals Procmon or the built in Resource Monitor to see what is going on.

 

Based on the information from Resource Monitor, you can see how much of CPU, RAM, Disk, and Network resources are being utilized on the system. I usually check resources in the following order:

  1. CPU
  2. RAM
  3. Disk
  4. Network

If any particular resource is consumed at a 100%, then more than likely other processes and applications on the server are going to be affected due to resource contention.

On the customer's server, I didn’t see the CPU pegged at a 100%, and they had plenty of free space in RAM . I started looking at their disk, and I immediately noticed that it was pegged at almost a 100%. More than likely this was a contributing factor to the customer's application hanging. Using Resource monitor, we were able to see a particular process scanning the hard disk. It turned out to be antivirus process that was running a scheduled scan and consuming all the Disk IO. Once we stopped the process, the customers application started responding immediately.

Digging a bit deeper using the Windows Debugger WinDbg, I was able to determine that the antivirus process was opening large batches of files and locking them so that no one else could access them while they were being scanned. We reached out to the AV vendor, and it was a known issue for which they had a fix. Once the fix was implemented, the problem went away.

Hopefully the above post gives you some ideas on how to go about troubleshooting issues you might be experiencing in your environment. Spend a few minutes thinking about the problem and poke around the system to get a better understanding of its current state. This tips above may not help you solve every problem, but I hope that they provide a framework for you to start investigating the problem and at least the narrow the focus of the investigation.
Beyond this, then you can start using tools like the Sysinternals Suite (especially ProcMon and ProcExp), or dig even deeper using the Windows Debugger. The NTDebug blog is a great place to learn more about these kind of troubleshooting techniques.
Happy hunting, and if you do need help, feel free to reach out to Microsoft Premier Support through you Technical Account Manager.


Comments (0)

Skip to main content