Why are there all these processes lingering near death, and what is keeping them alive?


A customer (via their customer liaison) asked for assistance debugging a failure that occurs when their system remains up for several weeks.

The customer seems to have a complicated system where they create and kill processes, and I am seeing hundreds of processes in the following state.

PROCESS fffffa80082a7960
    SessionId: 0  Cid: 1490    Peb: 7efdf000  ParentCid: 2614
    DirBase: 1b3fd0000  ObjectTable: 00000000  HandleCount:   0.
    Image: contoso.exe
    VadRoot 0000000000000000
    DeviceMap fffff8a000008ca0
    Token                             fffff8a00dbf9060
    ElapsedTime                       1 Day 01:25:38.983
    UserTime                          00:00:01.903
    KernelTime                        00:00:00.265
    QuotaPoolUsage[PagedPool]         0
    QuotaPoolUsage[NonPagedPool]      0
    Working Set Sizes (now,min,max)  (5, 50, 345) (20KB, 200KB, 1380KB)
    PeakWorkingSetSize                17981
    VirtualSize                       222 Mb
    PeakVirtualSize                   246 Mb
    PageFaultCount                    29532
    MemoryPriority                    BACKGROUND
    BasePriority                      8
    CommitCharge                      0
 
No active threads
        THREAD fffffa800a358b50  Cid 1490.2704  TERMINATED

This got me curious: Why are there so many of these near-death processes, and what could be keeping them alive?

I'll let you puzzle on this for a little bit. But you already know the answer.

(Waiting.)

(Waiting.)

Okay.

First thing you should observe is that this process is not actually alive. It has already exited. "No active threads." The one thread that is still associated with the process has terminated.

Why would you have a terminated thread hanging around inside a terminated process?

Because there is still an outstanding handle to the thread.

Even though the thread has exited, the thread object can't go away until all handles to it have been closed.

Now, when you create a process with the Create­Process function, you get a PROCESS_INFORMATION structure back which contains four pieces of information:

  1. A handle to the created process.
  2. A handle to the initial thread in the process.
  3. The ID of the created process.
  4. The ID of the initial thread in the process.

Of those things, you are probably interested in the process handle, because that's the thing you can wait on to learn when the process has exited. And you probably ignore the thread handle.

Oops.

You need to close the thread handle, or the thread cannot go away. It may have stopped executing, but the fact that you have a handle to it means that you can still do things like check if the thread has exited (yes, already!), ask for the exit code, ask for the thread ID, ask how much CPU the thread consumed during its lifetime, and so on, and all those statistics are kept in the thread object. And since thread and process IDs need to remain unique as long as there is still a handle to the object, the object needs to hang around so that it "occupies space" so that no other thread can grab its ID.

The customer called this process near death, but the more conventional term for it is zombie. In fact, zombie isn't a good term either, because this process and this thread are well and truly dead, never to walk again.

A better name would be corpse. The process and thread are dead. They're just lying there, rotting away in memory, waiting for all references to be released so they can disappear entirely.

Since the customer liaison said that the customer has "a complicated system where they create and kill processes", it's entirely possible that somewhere in the complicated system, somebody loses track of the thread handle, causing it to leak. It also calls into question whether they need this complicated system at all. Maybe their complicated system exists to work around some other problem, and we should be trying to solve that other problem.

Just for completeness, another possibility for the thread lying dead in the process is that some kernel driver has taken a reference to the thread and has gotten stuck.

We left the customer liaison with that information. We didn't hear back, so either our guess about thread handles was correct, or the customer decided we weren't being helpful enough and decided to stop talking to us.

Comments (30)
  1. Jimmy Queue says:

    Pining for the fjords...

  2. Nathan_works says:

    Why do I get the feeling a whole lot of customer issues end up closed because "[the] decided we weren't beling helpful enough and decided to stop talking to us. " mainly because the customer had a whole lot invested already in their VERY wrong approach ?

  3. David Crowell says:

    Could CreateRemoteThread bring the process back to life?  Not that you should....

  4. Joshua says:

    @David Crowell: I've been planning to mock up a test case just to find out, but I agree you *should* not.

  5. Darran Rowe says:

    Looking at the information provided, what I can see happening is something like a service is running an executable to do some work, and then it exits. The thing that also strikes me as important is that the executable only seems to run for around 2 seconds.

    So what I wonder is, could this be a service that runs a program to do stuff on behalf of a user, rather than using impersonation?

  6. Gabe says:

    It's interesting to see that the corpse isn't completely rotted away. Does anybody know what those 5 pages in the process working set are for?

    Darran Rowe: The process uses 2 seconds of CPU time, but that doesn't mean it only ran for 2 seconds. It could have run for a whole day, mostly waiting and doing I/O, and still only have 2 seconds of CPU time accumulated.

  7. alegr1 says:

    That's what one version of Visual Studio did (2008 or later) when it started builds.

  8. Evan says:

    > Pining for the fjords...

    Pining for the fjords?! What kind of talk is that? Look, I took the liberty of examining that thread, and I discovered that the only reason it appeared in the runlist in the first place was because it had been hard-coded there!

  9. SamYeager says:

    "Pining for the fjords?! What kind of talk is that?"

    I suggest a  search for 'Monty Python' & 'dead parrot'. :)

  10. Myria says:

    Windows NT's process system is *so* much better than UNIX's in this regard.  In Linux, for example, it is not possible for you to wait for a process that you did not fork()/posix_spawn() yourself to finish.  It makes process management really painful in Linux.

  11. Timothy Byrd (ETAP) says:

    @SamYeager:

    I suggest replacing "because it had been hard-coded there" with "because it had been nailed there", and etc.

    He doesn't need to search :)

  12. dmex says:

    "No active threads"

    That's not limited to leaked thread handles or driver references...

    When a process crashes on Windows 8.1 and Windows 10 - Windows Error Reporting will create a 'reflected' process with absolutely zero threads; If you open a process handle to that 'zero thread process' and call the TerminateProcess function; You'll get a Blue Screen of Death.

    Hopefully that customers "complicated system where they create and kill processes" doesn't attempt to terminate those type of processes.

  13. Darran Rowe says:

    @Gabe: This was one of the cases where I was trying to use context to my advantage. Since UserTime and KernelTime are the times when any threads of that process were being executed, I used that to be lazy and write less. I apologise for any confusion that I may have caused you.

  14. alegr1 says:

    dmex - a bug needs to be filed with MS, if it really happens.

  15. Dango says:

    So how can one test if a given PID corresponds to a zombie process or a still running process? A second OpenProcess() on such a zombie process will succeed! One could try to WaitForSingleObject() on the given HANDLE, but it will return with WAIT_FAILED and GetLastError() == ERROR_ACCESS_DENIED => this doesn't seem to be conclusive and could happen for a regular process, too, maybe if we're lacking access rights.

  16. Joshua says:

    @Dango: It's specific to Zombie Process if you opened the handle with the right access bits. Access is only checked in the OpenProcess() call.

  17. I am now adopting the terms "corpse thread" and "corpse process".

  18. Gabe says:

    Darran Rowe: If you really meant CPU time, why is it important that the process only used 2 seconds of CPU? The computer I'm sitting at claims an up time of over 10 days and about half of the 100-or-so processes show a CPU time of 2 seconds or less.

    Those processes include the likes of csrss, smss, wininit, winlogon, and the print spooler; so I don't see what is implied by having used only 2 seconds of CPU.

  19. Darran Rowe says:

    @Gabe:

    OK, I will explain everything in detail. But the entire thing was just hypothesising over the scraps of information, it isn't any kind of remarkable life changing idea. One possible thought on what things with processes they could be doing to get to this state.

    First there is the fact that it is not just the process with 2 seconds of CPU time, but there are hundreds of them in this near death state with two seconds of CPU time.

    Second, the process is in session 0, this is the session that services live in. If a service executes a process, then I'm certain that it also runs in session 0 unless the token is modified to run in another session. (Assuming post XP and 2k3, but since it is a 64 bit system, and the popularity of 64 bit server versions of Windows have increased over time, I don't think that is too bad of an assumption).

    Third, with the fact that there are so many processes with leaked handles, this hints more at the processes being short lived worker processes. If they were long lived, then you wouldn't end up with hundreds of them.

    With those bits of information, the thought was instead of using impersonation in the service, maybe because of threading reasons, they spin up a new process with CreateProcessAsUser. If you already have the work you want done in an existing application, it would be easier to just execute this instead of creating a new thread to do the work on. Depending on how good the developers are, it may have just been easier to write this work into a new application instead of dealing with the threading issues anyway.

    So that was more of the thought process which lead to that huge waste of time.

  20. Ken Hagan says:

    "So how can one test if a given PID corresponds to a zombie process or a still running process?"

    Off the top of my head, I think you can ask for the process's exit code. If it is still running, you'll get an error to say there's no exit code yet.

  21. Alois Kraus says:

    As far as I recall a zombie process has a thread count of zero. If you find a process with zero threads you can be sure that it is a zombie process. If you want to detect a crashed or zombified process you need also to check if it has all threads suspended except for one thread which is right now dumping the process. Things become more complicated if the process beeing checked is a managed process where you need to differntiate between a garbage collection and a memory dump operation.

  22. Killer{R} says:

    /*Myria: Windows NT's process system is *so* much better than UNIX's in this regard.*/

    Windows NT kernel is much better than most of UNIXes at all - its much more logical and even aesthetically beautiful.. However the biggest problem with everything beautiful is that when some its part become corrupted the whole thing becomes ugly. But when entropy's corruption touches system that is not initally designed by strict rules from bottom to top - almost nothing visually changes...

  23. cheong00 says:

    I think the conventional term for this kind of process is "defunct" process. It's been more verbose in the *nix world because the "ps" command will mark these process that has been exited but still exist solely because the parent process is monitoring it with "<defunct>".

  24. GWO says:

    @cheong00 : Note that the 'ps' manual page (at least on this Debian based system says:

    "Processes marked <defunct> are dead processes (so-called "zombies") that remain because their parent has not destroyed them properly."

  25. Gabe says:

    I would think that the term "zombie" would apply to David Crowell's hypothetical process that has exited, but where somebody called CreateRemoteThread on it to bring it back to life.

  26. Killer{R} says:

    IMHO windows developers took care to avoid zombie apocalypse. Such attempt likely will fail with STATUS_PROCESS_IS_TERMINATING.

  27. Andrei says:

    Stupid question: Is this statement "via their customer liaison" of any importance? After so long following your blog, it looks like you always offer all the required information for the post, but rarely something useless. So, what's the story behind it? I've noticed it for some time now.

  28. — "It also calls into question whether they need this complicated system at all. Maybe their complicated system exists to work around some other problem, and we should be trying to solve that other problem."

    Tell that to the Internet Explorer team. They do it. Oh, wait. Internet Explorer is dead. Okay, you can tell that to the Google Chrome team. Or the Firefox team.

    [I don't think the system used by browsers is all that complicated, and my guess is that they don't TerminateProcess as the primary lifetime management mechanism. Also, I did say "maybe". -Raymond]
  29. GregM says:

    'Is this statement "via their customer liaison" of any importance?'

    Yes, it means that Raymond didn't have direct contact with the customer, all information from and back to the customer was being filtered by someone.

  30. John Doe says:

    @GregM: and just as important, vice-versa.

Comments are closed.

Skip to main content