How do I wait until all processes in a job have exited?


A customer was having trouble with job objects, specifically, the customer found that a Wait­For­Single­Object on a job object was not completing even though all the processes in the job had exited.

This is probably the most frustrating part of job objects: A job object does not become signaled when all processes have exited.

The state of a job object is set to signaled when all of its processes are terminated because the specified end-of-job time limit has been exceeded. Use Wait­For­Single­Object or Wait­For­Single­Object­Ex to monitor the job object for this event.

The job object becomes signaled only if the end-of-job time limit has been reached. If the processes exit without exceeding the time limit, then the job object remains unsignaled. This is a historical artifact of the original motivation for creating job objects, which was to manage batch style server applications which were short-lived and usually ran to completion. The original purpose of job objects was to keep those processes from getting into a runaway state and consuming excessive resources. Therefore, the interesting thing from a job object's point of view was whether the process being managed in the job had to be killed for exceeding its resource allocation.

Of course, nowadays, most people use job objects just to wait for a process tree to exit, not for keeping a server batch process from going runaway. The original motivation for job objects has vanished into the mists of time.

In order to wait for all processes in a job object to exit, you need to listen for job completion port notifications. Let's try it:

#define UNICODE
#define _UNICODE
#define STRICT
#include <windows.h>
#include <stdio.h>
#include <atlbase.h>
#include <atlalloc.h>
#include <shlwapi.h>

int __cdecl wmain(int argc, PWSTR argv[])
{
 CHandle Job(CreateJobObject(nullptr, nullptr));
 if (!Job) {
  wprintf(L"CreateJobObject, error %d\n", GetLastError());
  return 0;
 }

 CHandle IOPort(CreateIoCompletionPort(INVALID_HANDLE_VALUE,
                                       nullptr, 0, 1));
 if (!IOPort) {
  wprintf(L"CreateIoCompletionPort, error %d\n",
          GetLastError());
  return 0;
 }

 JOBOBJECT_ASSOCIATE_COMPLETION_PORT Port;
 Port.CompletionKey = Job;
 Port.CompletionPort = IOPort;
 if (!SetInformationJobObject(Job,
       JobObjectAssociateCompletionPortInformation,
       &Port, sizeof(Port))) {
  wprintf(L"SetInformation, error %d\n", GetLastError());
  return 0;
 }

 PROCESS_INFORMATION ProcessInformation;
 STARTUPINFO StartupInfo = { sizeof(StartupInfo) };
 PWSTR CommandLine = PathGetArgs(GetCommandLine());

 if (!CreateProcess(nullptr, CommandLine, nullptr, nullptr,
                    FALSE, CREATE_SUSPENDED, nullptr, nullptr,
                    &StartupInfo, &ProcessInformation)) {
  wprintf(L"CreateProcess, error %d\n", GetLastError());
  return 0;
 }

 if (!AssignProcessToJobObject(Job,
         ProcessInformation.hProcess)) {
  wprintf(L"Assign, error %d\n", GetLastError());
  return 0;
 }

 ResumeThread(ProcessInformation.hThread);
 CloseHandle(ProcessInformation.hThread);
 CloseHandle(ProcessInformation.hProcess);

 DWORD CompletionCode;
 ULONG_PTR CompletionKey;
 LPOVERLAPPED Overlapped;

 while (GetQueuedCompletionStatus(IOPort, &CompletionCode,
          &CompletionKey, &Overlapped, INFINITE) &&
          !((HANDLE)CompletionKey == Job &&
           CompletionCode == JOB_OBJECT_MSG_ACTIVE_PROCESS_ZERO)) {
  wprintf(L"Still waiting...\n");
 }

 wprintf(L"All done\n");

 return 0;
}

The first few steps are to create a job object, then associate it with a completion port. We set the completion key to be the job itself, just in case some other I/O gets queued to our port that we aren't expecting. (Not sure how that could happen, but we'll watch out for it.)

Next, we launch the desired process into the job. It's important that we create it suspended so that we can put it into the job before it exits or does something else that would mess up our bookkeeping. After it is safely assigned to the job, we can resume the process's main thread, at which point we have no use for the thread and process handles.

Finally, we go into a loop pulling events from the I/O completion port. If the event is not "this job has no more active processes", then we just keep waiting.

Officially, the last parameter to Get­Queued­Completion­Status is lpNumber­Of­Bytes, but the job notifications are posted via Post­Queued­Completion­Status, and the parameters to Post­Queued­Completion­Status can mean anything you want. In particular, when the job object posts notifications, it puts the notification code in the "number of bytes" field.

Run this program with, say, cmd on the command line. From the nested cmd prompt, type start notepad. Then type exit to exit the nested command prompt. Observe that our program is still waiting, because it's waiting for Notepad to exit. When you exit Notepad, our program finally prints "All done".

Exercise: The statement "Not sure how that could happen" is a lie. Name a case where a spurious notification could arrive, and how the code can protect against it.

Comments (10)
  1. Joshua says:

    Please note the ability to use job for waiting for a process tree is actually an abuse of the system. Your code cannot use it if some other piece of code is using it to monitor you becaue jobs do not nest. (This was fixed starting in Windows 8; however given its market uptake if this is not back-ported to Windows 7 it is useless for a long time yet).

  2. Exercise: JOBOBJECT_ASSOCIATE_COMPLETION_PORT structure: "The system sends messages to the I/O completion port associated with a job when certain events occur. If the job is nested, the message is sent to every I/O completion port associated with any job in the parent job chain of the job that triggered the message."

    So if a process in your job happens to create a nested job, that nested job could generate notifications with a different completion key from your own job, yet these notifications would be sent to your job anyway?

    Seems like this is potentially breaking change /w Windows 8, since older Windows versions didn't have nested jobs, so a program never needed to worry about a job spawning child jobs that generate unwanted notifications.  I don't see why a parent job would be interested in the notifications of its children, where the completion key is not the completion key of interest.

    Or did I read the documentation incorrectly here?

  3. Smouch says:

    Seems the lack of the word "only" is a source of confusion.

    "The state of a job object is set to signaled ONLY when all of its processes are terminated ONLY because the specified end-of-job time limit has been exceeded."

    Or something…

    And so it goes.

  4. Killer{R} says:

    Jobs would very good tool.. If they wouldn't be half-cooked. Every time I tried to use them I faced with some very stupid problems. Like unability to nest jobs ('fixed' in Win8) or unfixable impossibility to change thread's active input layout if it belongs to process attached to job with UI objects restrictions. If not this jobs could be very powerful toolset for creating sandboxes, but instead due to that annoying 'little' problems they usable only as batch file execution ime limiters..

  5. Kalle Olavi Niemitalo says:

    I wonder whether using the job handle as the completion key is entirely reliable.  I mean, the child process does not inherit the job handle, so if the child process also calls CreateJobObject, I'm afraid CreateJobObject could return the same handle value as in the parent process, even though the handle refers to a different job object.  I considered other ways to get unique completion keys (a LUID from AllocateLocallyUniqueId, or the process ID of the process that waits for the completion port), but these too require that all processes in the job allocate completion keys in the same way.

    For JOB_OBJECT_MSG_ACTIVE_PROCESS_ZERO specifically, I guess I'd make the process first wait for the completion notification and check the completion key but then verify the situation with QueryInformationJobObject.

  6. Myria says:

    I use job objects to ensure that my child processes terminate in the worst case if for some reason the parent process terminates.

  7. Joker_vD says:

    I am genuinely surprised there is no easy way to make sure that your children processes won't outlive you, neither on Windows or Linux. When I ran into this problem the first time, I thought there had to be a CREATE_DONT_OUTLIVE_PARENT flag for CreateProcess, but there wasn't.

  8. Myria says:

    @Joker_vD: I agree – there ought to be one.  In Windows, using a job with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE mostly works, particularly if you trust the child process.  In Linux, the child process can use prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0).

    The problems with these approaches are that you have to trust the child process, and that there is a slight window of not working – in Windows, between CreateProcess and AssignProcessToJobObject, and in Linux, between fork and prctl.

    I don't know a solution for Mac OS.

    [This article shows how you close the window: You start the child process suspended. -Raymond]
  9. Myria says:

    [This article shows how you close the window: You start the child process suspended. -Raymond]

    Sorry, I forgot to mention that.  My implementation does use CREATE_SUSPENDED.  The problem, though, is that if you somehow crash or get terminated between CreateProcess and AssignProcessToJobObject, you leave the suspended child process as a zombie.

  10. Harry Johnston says:

    @Myria: one alternative would be to put the parent process into the job object, and let the child processes inherit membership.  If the real parent process can't be put into the job object, create a trusted child process as an intermediary and use it as the job parent.

Comments are closed.