Does Windows have a limit of 2000 threads per process?


Often I see people asking why they can’t create more than around 2000 threads in a process. The reason is not that there is any particular limit inherent in Windows. Rather, the programmer failed to take into account the amount of address space each thread uses.

A thread consists of some memory in kernel mode (kernel stacks and object management), some memory in user mode (the thread environment block, thread-local storage, that sort of thing), plus its stack. (Or stacks if you’re on an Itanium system.)

Usually, the limiting factor is the stack size.

#include <stdio.h>
#include <windows.h>

DWORD CALLBACK ThreadProc(void*)
{
 Sleep(INFINITE);
 return 0;
}

int __cdecl main(int argc, const char* argv[])
{
int i;
 for (i = 0; i < 100000; i++) {
  DWORD id;
  HANDLE h = CreateThread(NULL, 0, ThreadProc, NULL, 0, &id);
  if (!h) break;
  CloseHandle(h);
 }
 printf("Created %d threads\n", i);
 return 0;
}

This program will typically print a value around 2000 for the number of threads.

Why does it give up at around 2000?

Because the default stack size assigned by the linker is 1MB, and 2000 stacks times 1MB per stack equals around 2GB, which is how much address space is available to user-mode programs.

You can try to squeeze more threads into your process by reducing your stack size, which can be done either by tweaking linker options or manually overriding the stack size passed to the CreateThread functions as described in MSDN.

  HANDLE h = CreateThread(NULL, 4096, ThreadProc, NULL,
               STACK_SIZE_PARAM_IS_A_RESERVATION, &id);

With this change, I was able to squeak in around 13000 threads. While that’s certainly better than 2000, it’s short of the naive expectation of 500,000 threads. (A thread is using 4KB of stack in 2GB address space.) But you’re forgetting the other overhead. Address space allocation granularity is 64KB, so each thread’s stack occupies 64KB of address space even if only 4KB of it is used. Plus of course you don’t have free reign over all 2GB of the address space; there are system DLLs and other things occupying it.

But the real question that is raised whenever somebody asks, “What’s the maximum number of threads that a process can create?” is “Why are you creating so many threads that this even becomes an issue?”

The “one thread per client” model is well-known not to scale beyond a dozen clients or so. If you’re going to be handling more than that many clients simultaneously, you should move to a model where instead of dedicating a thread to a client, you instead allocate an object. (Someday I’ll muse on the duality between threads and objects.) Windows provides I/O completion ports and a thread pool to help you convert from a thread-based model to a work-item-based model.

Note that fibers do not help much here, because a fiber has a stack, and it is the address space required by the stack that is the limiting factor nearly all of the time.

Comments (30)
  1. Travis Owens says:

    Of course we could raise the question, should any application be generating THAT many threads that this should be an issue in the first place.

    Or maybe I’m re-preaching "640k is enough for anybody" concepts.

  2. Joe Chung says:

    While it doesn’t make sense to have 2000+ threads on a single CPU yet, it might make sense in the (not so far) future when a machine may have several processors, each of which might be multi-core (2 or more processors per chip).

  3. Tim Ritchey says:

    Ahh, I/O Completion ports. Everyone seems to be on an IOCP kick these days. I was all excited about them myself until I learned that you cannot support older versions of Windows. At that point, the consensus seemed to be that OVERLAPPED I/O was the way to go for scalable networking sans threads. God help you if you are looking for a good example of how to create a robust overlapped networking application. At least more than just a simple echo client and echo server. I spent more time than I would like to admit getting everything to work. I don’t suppose anyone knows of a good real-world example of OVERLAPPED networking?

  4. kbiel says:

    I think Raymond has it right, the people asking this question are probably creating a thread per object. They probably don’t have an expectation to run all 2000+ threads simultaneously, they just don’t understand thread pooling.

  5. mschaef says:

    ‘The "one thread per client" model is well-known not to scale beyond a dozen clients or so.’

    Perhaps that should read ‘one win32 operating system thread per client’. From what I understand, never having used it myself, Erlang does a pretty good job of making application level threads cheap to create and dispatch:

    http://wagerlabs.com/tech/2005/05/27000-games.html

    "I have a PowerBook G4 1.25Ghz with 512Mb of memory. Running 27K games consumes about 600Mb of memory and takes around 15 minutes per 10K games due to heavy swapping. … Assuming an average of 5 players per game that’s 135,000 players and 405,000 simultaneous processes. "

    As you’ve pointed out before (http://blogs.msdn.com/oldnewthing/archive/2005/02/11/371042.aspx) sometimes it makes sense to think past OS-native services.

  6. Ray Trent says:

    32-bit Windows… how… 20th century.

    So I guess this isn’t likely to be a limit for very much longer, eh?

    Of course, people will complain about having a mere 8 trillion threads, no doubt…

  7. oldnewthing says:

    Of course if you disallow native code you can get away with more tricks. A continuation does a lot less memory than a thread but its use is limited to managed environments. Great work if you can get it.

  8. mgrier says:

    The core problem here is the sequential programming model combined with the native compilers’ reliance on the machine stack.

    As Raymond points out, continuations can effectively give you a ton of "threads". You don’t need managed code; there have been plenty of native-code-generating continuation-passing languages over the years. Continuation-passing doesn’t even solve everything because you need synchronization primitives which are associated with these continuations/contexts instead of operating system units of execution. (For example, I have no idea how well critical sections and fibers interact on NT…)

    You do have to wonder then what a "thread" means. If it means whatever you want it to mean in a context, we’re compating apples and oranges. In Raymond’s context, he’s talking about threads as an OS scheduling concept, not a programming language level concept.

  9. nikita says:

    It’s true that thread-per-connection architecture has scalability problems, but its alternative has its drawbacks too. So many of them actually, that after my first serious encounter with non-blocking-state-machine type of server (SquidNG, circa 2000), I ended up compiling a list of problems I had with that type of architecture: http://nikita.w3.to/thread-per-connection.html

    Since then I had to write a lot of similar code both in user and kernel space, but my opinion is still the same: "Async servers? Not in C.".

  10. Jonathan says:

    Presumably this limit is higher in 64-bit processes?

  11. Tim Smith says:

    Maybe I am just getting grumpy in my old age but…

    I’ve done plenty of non-scalable threading system. I would even swear Larry O. used some of my code in an article about poor threading models using events. But I understood it the ramifications of what I was doing. I knew it wouldn’t work well in a high stress environment.

    Are people programming these days without a basic understanding of how computers work? A simple philosophy of "Nothing is free" would raise a red flag on the 2000 thread issue. Even before I got to sentence two in this article I was asking "But why?"

    *sigh* *rant off*

    The really sad thing is that it all the programmers who know they still have a lot to learn are the guys who are reading these blogs. It is the people who don’t know that they don’t have a clue who need to be here.

    Ugh, I shouldn’t post before my morning coke.

  12. Ulric says:

    STACK_SIZE_PARAM_IS_A_RESERVATION is new in XP, for previous versions of Windows we’re entirely stuck because you can only increase the reserved stack side for each thread from the default, not decrease it.

    This means that since the main UI thread generally has a pretty large stack (several megs, for example, in a significant C++ application), any worker or helper thread end up having as large a reserved stack size. This results in applications taking up much more memory than they anticipate for every additionnal thread, without knowing it. Something I learned the hard way, let’s just say!

  13. Shailesh says:

    From MSDN topic "Thread Stack Size" : "A stack is freed when its thread exits. It is not freed if the thread is terminated by another thread."

    Why exactly is the stack not freed if terminated by another thread?

  14. an0nym0us says:

    I/O Completion ports work until you get blocked by a library or system API that is inherently sequential and does support overlapped operations (Winsock name resolution, and authentication APIs, for example). These can take few seconds and then you end up with a thread-per-client or a denial of service.

  15. LarryOsterman says:

    Shailesh, because terminating a thread means simply unhooking it from the scheduler (which causes it to stop executing since it will never be scheduled again).

    The NT scheduler doesn’t know about thinks like "stacks", a "stack" is a memory management thingy.

  16. dinov says:

    mgrier:

    On fibers & critical sections: they work like water & oil. Critical sections are per-thread and don’t see fibers at all. If Fiber a is running on thread 1 and takes a crst and then is switched out & back in on thread 2 and attempts to re-acquire the critical section it blocks. If fiber b was previously running on thread 3 and is then switched to thread 1 and attempts to acquire the same critical section it succeeds.

    This and all other thread based synchronization (in other words, all synchronization) was one of the big gotchas with the CLR & fiber mode support. It’s easy to eliminate this in your own code, but harder to eliminate from other code in your process. One fun issue is the loader lock – which is just another critical section to watch out for (and on win64 it gets more fun w/ exception handling!). Not to mention all the other thread-locale state that you need to watch out for that may bite you when using fibers.

    Ahh fibers… this is not the answer you’re looking for.

  17. Nathan Moore says:

    Re: mschaef

    That poker tech thing is talking about processes not threads. You are comparing apples and oranges.

  18. mschaef says:

    Re: Nathan Moore

    I believe the term process, in this context, comes from Tony Hoare’s "Communicating Sequential Processes", a book that contains a lot of the theoretical basis of Erlang.

    From http://www.erlang.org/faq/t1.html,

    "Concurrency and message passing are a fundamental to the language. Applications written in Erlang are often composed of hundreds or thousands of lightweight processes.Context switching between Erlang processes is typically one or two orders of magnitude cheaper than switching between threads in a C program. "

    In any event, I would think having a bunch of OS processes would be worse than having a bunch of OS threads.

  19. Skywing says:

    Actually, there’s a new feature in XP/2003 that lets you (well, you as in the code that creates threads via NtCreateThread — kernel32, in the case of win32 programs) specify a pointer to be auto freed by the kernel when a thread is terminated (released via NtFreeVirtualMemory).

    The reason why the kernel doesn’t automatically free user stacks in general is that the kernel didn’t allocate them in the first place, so it doesn’t necessarily know how to free them. The new auto stack free stuff works for all Win32 threads, which have stack allocated via a new block returned by NtAllocateVirtualMemory. You could conceivably have a thread that uses stack allocated by HeapAlloc(), or that’s a global variable in a loaded image, or any number of other odd things, though, and in such cases you wouldn’t want the kernel trying to free the stack with NtFreeVirtualMemory.

    "an0nymo0us": For those things, you’re pretty much stuck with using a dedicated thread pool or implementing them yourself (i.e. DNS queries).

  20. TC says:

    Hmm, am I the only person here with 30+ years experience who has never heard of continuation style programming? :-(

    TC

  21. Merle says:

    TC: nope, you’re not the only programmer who has not heard of continuation-style programming (although I’m only halfway to 30 in terms of professional experience).

    However, thanks to the link from Andreas, I now know that I’ve used it. Ah, the joys of setjmp(). I learned so many brutal hacks from reading the source to BSD’s telnet.

    Like genget(), their happy command parser. I abused that in so many of my early C programs.

    http://ftp.gcu.info/tuhs/PDP-11/Trees/2.11BSD/usr/src/ucb/telnet.c

  22. You may check Kegel’s excellent resource http://www.kegel.com/c10k.html

  23. Craig Ringer says:

    nikita, given your "not in C" comment with regards to threading alternatives, you might be interested in examining Twisted – it’s a Python framework for deferred "reactor"-style execution. Some of the things people have done with it are pretty astonishing.

    Speaking of which… I don’t suppose anybody here knows if IronPython is going anywhere? Python on .net might be very interesting indeed.

  24. Craig Ringer says:

    Actually, it appears that many of the criticisms leveled against async servers in C apply to Twisted too. I doubt it’d interest you much, though it’s evidently very useful for many problems.

  25. It’s not like there’s one number that controls everything.

Comments are closed.