Hi again! Today I want to bring to your attention an upcoming series of posts on troubleshooting hangs and this post as a primer for understanding hangs and how we scope these scenarios.
Scoping is a practice we use in troubleshooting that helps us to quickly narrow down the domain or scope of a problem from the entire operating system or enterprise to a specific computer and component. This allows the elimination of millions of other possible problems or interactions.
Hangs are a common and can be a sometimes lengthy support request because of the mere nature of the problem, and just describing it can be difficult. “Okay, what do you mean, it’s hung?” By nature I mean that some internal architecture knowledge is necessary to discover what component of the application or OS is not doing what is should and thus leaving us with either an unresponsive user interface or service or both. So how do we isolate what is going on here?
We will cover the main buckets or symptoms and I will list these in increasing depth or dependency into the OS below, in other words, moving from the Application Layer down into the OS. But let’s scope first…the most important step!
Scoping the Hang
We can determine which bucket or symptom we are running into by testing increasing layers of the operating system (OSI stack). Meaning, what layer of the system is working and which ones are not. The heart of this is to determine “What IS working properly and what IS NOT?”
The following table outlines the layers and tools we usually use to determine their responsiveness.
Functional Layer to Test
Tools To Test
Basic hardware + Network driver + Bottom of the network stack
Does Ping work? Num Lock light on keyboard?
SMB over Tcp/ip + Kernel as Server Service runs in the system process)
Does Net view work?
Rpc over Tcp/ip
For example, if a machine is reported “hung” and we can ping it, and net view does not work (when it normally would) we should conclude that the server side of that request failure in most likely in the Server Service or one of its sub components. This being the case it would not make sense to troubleshoot why myapplication.exe is hung on the same server if lower level things like the server service itself do not work which may be a direct dependency!
Tip: This is a scoping method we use in isolating all problems. Look at the interaction of applications, services, the OS, drivers, etc. in light of their dependencies. “Okay, A is failing not because of A but because B failed, because C failed, and aha here is root cause in D’s failure”. Testing dependencies can yield considerable time savings vs. debugging “through” the application. Another example, if an RPC dependent application stops working, testing RPC by using another RPC app might be the first thing to do vs. debugging the first app which could be very time consuming and require specific knowledge about that app.
Here are some common scoping questions to help think about the context of the issue which could also isolate the problem quickly.
- What is the smallest action we must take to recover from the hang?
- How often does it occur?
- Does it occur on a cycle? If so, what cycle?
- What time did the last occurrence take place?
- Was it under load at that time?
- What was that load?
- When does it occur, at a particular time of day?
- How long does the hang last?
- Can we make it hang?
- What else happened just before the hang?
- What changed?
- When did it start?
- Relevant or timely errors in the Application or System logs?
- Is the observation from the console or a remote (RDP or ICA) session?
- Does the machine still hang if we disconnect from network?
- Does Task Manager show a particular process taking up CPU?
- Does the hang occur in Safe Mode or Safe Mode with Networking?
Answering these simple questions may have obvious yet extremely helpful results.
For example, if the machine is reported as hung and the observation was just made through a Remote Desktop (RDP) session, is it responsive at the console? Let’s say it is responsive at the console, we must then conclude that only the Terminal Server Service layer or one of its unique dependencies (lower in the stack) is the problem vs. the entire server. Jump to Terminal Services specific troubleshooting, etc.
Common Hang Buckets or Symptoms
Using the above scoping usually leads to these main classes of hangs which we will cover in future posts:
The application looks “OK” in that it will repaint if we drag another window over it; however, if we click on a menu item or send a key stroke whatever functionality associated with it does not…function.
2.) Application Window Hang “I’m not dead yet…just Not Responding”
The application stops responding entirely at the UI layer, meaning it no longer refreshes and dragging another window over the top does not repaint thus displays artifacts of other windows result.
3.) The Start Menu, Desktop, or the “Shell” is hung
So here we know that the Microsoft process responsible for these windows, explorer.exe, is hung.
4.) All Windows are Hung…but Task Manager comes up eventually
Here the mouse still moves, and if we hit Ctrl+Shift+Esc we can invoke task manager, or via Ctrl+Alt+Del. This may not be a true hang, but slow or unresponsive enough to qualify or be reported as a hang!
5.) There are No Windows!
I can move the Mouse and Keyboard but they don’t “do” anything and there’s just a blank desktop, no windows, it’s hung.
In this case it may be that the server appears hung interactively while specialized services like file sharing, mail server, etc. still actually function…impending doom?
6.) No Windows + No Mouse/Keyboard + but the machine is still running, well, sort of…
Obviously the most drastic of the symptoms leaving little recourse but a debug of the machine…which might be easier than it sounds!
The server may or may not be responsive remotely via services, etc.
In each of the upcoming posts expect to see for each symptom:
Scoping Steps (what works vs. what doesn’t in each scenario)
Specific Debug Steps
Please look forward to these installments in the New Year!