Written by Jeff Dailey:
As a debugger, have you ever reflected on the interesting parallels between your job and work being done in other industries? When I think about solving complex computer problems, I think of it as forensics. The core diagnostics or troubleshooting skill can be applied to anything. I bet if you were to walk around and talk to the people in our group you would find a lot of guys that watch programs like CSI, House, Law and Order etc. Heck there is even a guy just a few cubicles away that has a House poster on his wall. Years ago I used to watch Quincy, and other detective shows with the same level of fascination. I started thinking Cops, Coroners, Doctors and Engineers, we all do the same type of work. In fact, at a recent conference we had a forensics expert come in and speak to us. It was really fascinating. Even though the presenter worked in a completely different industry, most of us found each other thinking "Hey I could do that Job!", because in essence, that's what we do already.
I started having some fun thinking about the parallels between detective shows and our work.
Cut to scene:
It's 2:00AM. The camera zooms in on a pager going off on a night stand… It can only mean one thing. Something bad has happened and people are reaching out for help. The detective wakes up and tells the wife, "Sorry, they need me… I've got to go."
Funny, I've done the same thing, only because someone found a dead server.
The detective shows up at the scene of the crime. All the officers on-site are baffled, so they just keep things roped off until the expert gets there. His years of experience and unique insight will allow him to see things others don't.
Hmm… This seems familiar only I typically use Live Meeting or Easy Assist...
Using a combination of specialized tools and methods learned both in school and from other's methods handed down over time, evidence is gathered at the scene so that additional research can be done back at the office. Witnesses are questioned, "So about what time did this happen?", "Did you hear any unusual noises", "and did you see anyone or anything unusual". Pictures are taken, Objects are collected, Fibers, and DNA samples are gathered.
Ok so the scope of the problem is determined and all available data is collected. Hmm, I do this every day.
The Mayor calls the office to tell the chief of detectives that we must have this case solved. It can't happen again. We must catch the Villain!
Feel free to substitute Mayor with any high level management figure. Wow this is either a nasty bad guy or someone's driver is causing pool corruption causing a critical server to crash!
We now cut to a montage were the detective is in the Lab, using luminal, searching for DNA evidence, reflecting on the core facts of the case, researching past crimes.
I don't know about you, but I simply refer to this as the debugging process.
Finally a breakthrough, DNA collected at the scene of the crime identifies a suspect that should not have been at the scene. In doing additional research the suspect has a history of this type of activity. The bad guy is hauled in, charges are filed, and the case is solved!
This would equate to finding root cause, filing a bug, and getting a fix out the door!
Ultimately that's what we do. We are all detectives looking for the digital DNA of bugs in the wild affecting our customers. We hunt them down using tools, expertise, and experience.
When it comes to collecting critical forensic information and looking for that Digital DNA of a bug it often comes down to getting a dump of the process or system.
GES (Global Escalation Services, formerly known as CPR) Escalation Engineers have probably looked at more dumps then the average person passes telephone poles in a lifetime. Don't get me wrong, we do a lot of live debugs also, however dumps are the staple item in our debugging diet.
To begin, let's go over why we typically ask for a dump. Customers often think it's drastic to bring down an entire server via a "CRASH DUMP", is it worth it? The answer is ABSOLUTELY!!!
Full User-Mode dump
This dump file includes the entire memory space of a process, the program's executable image itself, the handle table, and other information that will be useful to the debugger.
Figure 1: Scope of a Full User-Mode dump
A Kernel Summary Dump
This dump contains all the memory in use by the kernel at the time of the crash.
This kind of dump file is significantly smaller than the Complete Memory Dump.
Figure 2: Scope of a Kernel Summary dump
A Full/Complete Memory Dump
This is the largest dump file. This file contains all the physical memory for the machine at the time of the fault.
Figure 3: Scope of a Full/Complete Memory dump
When you open a dump with WinDbg, you can use the || (pipe pipe) command to determine the type of dump that you are analyzing
If the problem does not directly involve any user mode processes or we suspect there is a driver at fault we may simply ask for Kernel Summary dump.
If the problem involves user mode (application) code along with actions taking place in the kernel, or there are multiple process on the same machine making cross process calls, we will need the full user and kernel mode addresses ranges from the machine (Full Memory dump) so we can debug into the various user mode parts of the code that were being executed at the time the crash dump was captured.
Now that you have crashed your server and collected a dump for us to look at, you may ask what we can tell from the full / kernel dump. Well the following is just a small fraction:
- The state of every thread call stack in every process on the machine.
- The state of every lock on the machine.
- The destination for RPC, DCOM, and LPC calls.
- The CPU Usage of every thread in every process, how much kernel time vs. user time, how long since it last executed.
- How long has a thread or process been alive?
- The version, date, size, and checksum for binaries loaded in memory (not paged out) at the time of the dump.
- What process spawned what process or thread?
- How much kernel and user mode memory is being used by any given process?
- What programs are being run in what session?
- What connections are currently made to what remote resources, things like IP Address and ports?
- What files are open on the machine, locally and remotely?
- How long the server has been up.
- We can reconstruct bitmaps from GDI surfaces.
- We can dump running scripts.
- How long any I/O has been outstanding to any device.
- If any I/O or device has had an I/O error.
- What the last error was on any thread on the machine.
- We can figure out who is corrupting memory, if memory corruption detection technologies are employed before the dump is taken.
- We can at times figure out what driver or process is leaking memory.
- What handles are being leaked.
- In most cases we can match the line of source executing during every stack frame of every call to the windows source code and determine why each discreet call in the call stack was made.
- We can dump parts of the Windows Event log.
- We can dump out all of the SMB connections info and we can tell what types of activity is happening in the server services.
- We can integrate the state of the system cache.
- We can find out what filter drivers are in the I/O chain.
What about LiveKd?
Sometimes we need to collect a kernel dump from a server and we are just not able to actually crash the server. Either the customer does not want us to, or their business will not allow it. In this case we can use the Sysinternals tool LiveKd. LiveKd Works with our kernel debugger by installing a device driver and extracting all the memory out of kernel to user-mode so that the debugger can open the rolling snapshot of the kernel. You can then write this dump file to disk by doing a .dump /f C:\livekd.dmp. While this will not crash or halt the server, it does give us some kernel information.
Words of Caution!
LiveKd does not provide an atomic snap shot of the server because while the driver is collecting the memory from the kernel the operating system is still running. It can take several minutes to get a LiveKd Dump and during this time LiveKd may start reading memory for structures such as lists, arrays, threads and other items that may be changing during the collection of the snapshot. This being the case, the timing data within the dump is no longer valid. For example: You are not able to see if one thread has been waiting longer than another based on its idle ticks because the items were not collected simultaneously. Also, linked lists and pool may appear corrupt or inconsistent in the dump. This may make some debugger extensions loop endlessly or even crash. However you can use this output to get an idea of what is going on in more general terms in respect to static variables such as handle counts in handle tables. You can often dump out handles, look at how many threads are in various processes and the types of things they were doing during the window the dump was collected. It's just very difficult to draw definitive conclusions from this type of dump. However, in those cases where we have no other option, using LiveKd to get a dump can provide valuable information.
About User Dumps
So what about "user dumps"?
Why, when and how? We ask for user dumps if we know the problem is limited to just one process having trouble. User dumps are usually associated with High CPU, Memory Corruption, Memory Consumption, or a Hung process. A user dump is only providing memory from a single process vs. all processes (See Figure 1 above). We can only debug the process you are targeting.
High CPU: "Three in a Row and we're Good to Go"
When a process is consuming a lot of CPU we will typically ask for 3-4 dumps, and these are usually taken 10-15 seconds apart. We recommend using either userdump.exe, adplus –hang (process id) or by attaching windbg.exe to the process and doing a .dump /ma C:\dump1.dmp ,2 ,3, etc. The /ma switch collects some extra information from the kernel and stores it in the user dump. Without this extra data in the user dump we would not be able to get things like thread execution time and handle information. The thread execution time allows the !runaway and .ttime commands to work on the dump. Without this data we could not tell what thread is consuming CPU.
We can then open the successive dumps and check the various states of the threads in the process over time. If one thread is constantly changing in each dump and consuming more and more CPU this is the thread that is typically at fault. We then examine the reason each call is being made in that thread's context. We also check for things like the last error that occurred on that thread using !gle (Get Last Error).
When it's Hung we Just Need One
When a process is hung it's typically due to a deadlock, or an application making a call in its window proc that is blocking preventing a repaint event from happening. In this type of case we need to get a dump once the application has become unresponsive. You can use WinDbg and dump the process via .dump /ma, use userdump.exe, or adplus –hang. Once we get this user dump we typically will look for a thread waiting on a critical section, event, semaphore or system call. You can pretty quickly get an idea of what is going on by doing a ~*kv and looking at the various thread states. If you see critical sections being waited on you can run !locks and it will tell you what the lock dependency is.
Adplus –crash and You'll Save Some Cash
With a process that is crashing, either due to memory corruption, divide by zero, access violations, or any number of potential unhandled exceptions we need to have either a debugger attached ahead of time, or adplus –crash monitoring the process in question. If we get a dump of a process before it has crashed, or after it has crashed and restarted it will not show us what we need to know. We need to be watching the process ahead of time so we can catch the dump in state at the time of the event. Adplus –crash will typically catch a second chance exception. This is for exceptions that occur in the application code that are not handled by the applications exception handler. In this case it falls though to the operating system to handle and we then typically tear down the process in question and/or invoke JIT, Just-in-Time debugging (Search for the AEDebug registry key for details). If you are using windbg.exe and have attached to the process a "Crash" or unhandled exception should halt or break into the debugger. At this point you can do a .dump /ma C:\my dumpfile.dmp
What about intermittent problems?
An intermittent problem can be one of the most difficult to isolate. Often times a condition may only last a few seconds yet it could critically effect server or application operation. In this type of scenario we have to get inventive; literally inventing new tools and methods of catching the bug or problem in the act. A good example is the sample I posted earlier of catching a hung window. This sample shows how to monitor something; in this case windows message pump responsiveness and take action if the response time falls out of your specified parameters. See "Detecting and automatically dumping hung GUI based windows applications."
Good luck and happy debugging!
PS: Doesn't CSI stand for Crashed Server Investigation?