Windows 95 and Windows NT manage the TEB on x86 differently


Today, a historical note of no consequence. Just a story.

The Win32 x86 ABI specifies that the FS register holds a selector which is based at the current thread's TEB. In other words, fs:[n] is the nth byte of the TEB.

It so happens that the two operating systems chose to manage the FS register differently.

Windows 95 gave each TEB in the system its own selector.

Windows NT allocated a single selector to represent the TEB, and each time the processor changed threads, the selector base was updated to match the TEB for the new thread. With this model, every thread has the same value for FS, but the selector's descriptor kept changing.

It's as if you had a car-rental service, and one of the features of the service is that the radio remembers your presets. The instructions for setting the radio are as follows:

  • Enter the four-digit customer preferences ID printed on your receipt.
  • Your radio is now set to your preferences.

There are two ways you could set up this system.

Windows 95 assigns each customer a unique customer preferences ID and prints it on the receipt. When the customer enters the ID, the radio looks up the ID in a database and applies the user's radio preferences.

Windows NT prints the same preferences ID on every receipt: 1234. Before the customer picks up the car from the rental service, the service agent sets the radio to the customer's preferences. When the customer enters the ID, the radio does nothing (aside from printing an error message if you enter anything other than 1234).

Even though the Windows NT way creates more work for the service agent, it does solve an important problem: It lets your service scale to more than 10,000 customers, for once you have 10,001 customers, you cannot assign each of them a unique four-digit ID any more.

Car Windows
car processor
customer thread
radio preferences TEB
customer ID selector

By assigning a unique selector to each TEB, Windows 95 limited itself to at most 8192 threads. (In practice, the limit was much lower because selectors were used for other things, too.) This was not an issue in practice because Windows 95 would run into other problems long before you ran into the 8192-thread limit.

But Windows NT was designed to be scalable to very large workloads, and they couldn't artificially limit themselves to a maximum of 8192 threads.

The consequences of changing the meaning of the FS register at every thread switch means that some tricks that happened to work in Windows 95 didn't work in Windows NT. For example, in Windows 95, if you captured the value of the FS register in one thread, you could use it (in violation of the ABI) on another thread in the same process and still access the originating thread's TEB. If you tried this trick on Windows NT, you would just see your own TEB.

In the analogy, it's as if you copied the customer preferences ID from another customer's receipt and tried to use it in your car. In a Windows NT car, you would just get your own preferences again, because every receipt has the same customer preferences ID printed on it.

According to the Win32 ABI for x86, the only thing you can do with the FS register is dereference it to access your TEB. If you try to fiddle with its value or try to copy its value somewhere, you are off in unsupported territory, and the resulting behavior is undefined.

Comments (22)
  1. anonymouscommenter says:

    Raymond, I never tire of your wonderful analogies used when explaining sometimes slightly obscure topics.

  2. anonymouscommenter says:

    Is 8K threads really so bad?  Assuming 64KB of stack, you'll have used up 512MB of address space by the time you get that far, which is a quarter of the 2GB limit.  And that may not even be the bottleneck.  In "Pushing the limits", Mark Russinovich says that each thread needs 12KB of non-pageable memory in the kernel.  Assuming this was true back when NT came out, you would hit need 96 megs of ram to handle more threads than that anyway, which was a "insane" amount of memory in those days (source: blogs.msdn.com/…/54640.aspx).  I'll admit ram is a short term concern due to Moore's law, but still: is it worth it?  (I would normally say it's harmless so why not do it anyway, but as you pointed out, there are programs that might break thanks to this optimization.  Not many, but then again, not many would benefit either.)

  3. anonymouscommenter says:

    Were there any compatibility shims added in Windows NT for those misbehaving programs that, when activated, gave each thread in the process its own unique selector?  Or were those misbehaving programs left to break in Windows NT (and their users left to blame Microsoft)?

  4. Myria says:

    Some related stuff about this that has changed since Windows 7…

    Windows 7 added User-Mode Scheduling, allowing switching threads in user mode.  This is only supported in the x86-64 build, as far as I know.  In 7 and 8.0, this was implemented by creating an LDT (local descriptor table) for the process, and creating a descriptor+selector for each user-mode schedulable thread's TEB, much like Windows 9x did.  The user-mode scheduler code then switched threads by changing the value of the GS register.  (*)  Like Windows 9x, this results in a limit of 8191 threads, but this limit is per-process, not for the whole system, which is much more reasonable.

    Windows 8.1 also supports this LDT mechanism, but adds a new one: wrgsbase.  If the host CPU supports the rdgsbase/wrgsbase instruction, Windows 8.1 will enable this instruction and also permit its use from user mode.  Instead of creating an LDT and descriptors for user-mode threads, the kernel just lets user mode arbitrarily set the GS base to whatever address.  Upon entry to kernel mode, the kernel reads the GS base and looks up what address user mode had assigned to GS's base; this base becomes the current TEB.  (This is checked against the list of valid TEBs for security reasons.  It's only a security risk while in kernel mode, though, so checking upon entry is sufficient.)

    (*) x86-64 Windows uses GS for the TEB, whereas x86-32 Windows uses FS.  The reason for this is that the fast x86-64 instruction to simultaneously read out user mode's GS base and replace it with kernel mode's GS base, swapgs, does not have an FS equivalent.  It is doubly convenient to be a different segment register from x86-32 Windows because both the x86-32 and x86-64 TEBs can be loaded into segment registers while in WOW64.

  5. anonymouscommenter says:

    Minor correction to the above: The wrgsbase stuff for Windows 8.1 and 10 is only used for User-Mode Scheduling threads.  I forgot the word "scheduled" in one sentence.

    wrgsbase can be used by non-UMS programs on Windows 8.1 and 10, but the next time the kernel is entered on that CPU, which is the next clock tick or system call, the GS base will be set back to a valid TEB, so it's not a good idea.  Use the real UMS API, which handles user-mode scheduling for you, whether by LDT selectors or wrgsbase.

    wrgsbase support being enabled and allowed is indicated by IsProcessorFeaturePresent(PF_RDWRFSGSBASE_AVAILABLE).  Don't use cpuid to check for the feature.

  6. anonymouscommenter says:

    Why Windows x64 continue to use segments? Could all of that be handled another way?

  7. anonymouscommenter says:

    @Myria: I think you accidentally answered my question.  I didn't realize the 8k limit was not per process.

  8. anonymouscommenter says:

    Correct, the limit of 8K was system wide.

  9. anonymouscommenter says:

    "Is 8K threads really so bad?"  Yes.  According to Resource Monitor I have 2377 threads right now.  This is on my Win 7 64-bit laptop with a modest 16 GB RAM (currently uisng half of that) currently running: a few dozen Chrome tabs, an Internet Explorer tab, 3 Notepads, CMake GUI, Git Extensions, Process Explorer, 3 command prompts, an explorer window, Task Manager, Skype, Outlook, KeePass, ScanSnap, and 20 system tray icons I can't be bothered to list right now.

    OK so it's not 8K threads yet, but I bet I've hit that limit in the past and also on my work computer.  It's certainly within striking range.

  10. anonymouscommenter says:

    @James Johnston: You would have hit the memory limit first. Anyway, Win 95 cannot possibly hit the limit as this requires > 128MB RAM, which causes Win95 to crash on boot.

  11. anonymouscommenter says:

    Presumably NT must have allocated one TEB selector per CPU?

  12. anonymouscommenter says:

    "Is 8K threads really so bad?"

    It ought to be enough for anybody.

  13. anonymouscommenter says:

    @kme – you don't actually need to have a different value of the TEB selector per CPU. You can instead have a copy of the GDT per CPU instead because there is no need for all CPUs to point to the same GDT. This uses more memory since you have to have many copies of the GDT, but stops you from needing to bake the maximum number of CPUs into the GDT, although it amounts to the same thing. It is also useful for the TSS selector. This is only an issue for x86-32 anyway, because on x86-64 fs and gs relative accesses use the regular data segment, but apply the offsets in the fsbase and gsbase registers so you don't need to allocate selectors for them.

    @EduardoS – Windows on x86-64 continues to use segments because the CPU continues to use segments and Windows has to work on the CPU. Segments on x86-64 don't work the same though. Data segments always start at virtual address 0 and always have full length – the CPU will fault if you try to do anything different from that in long mode. Think of them as being like the appendix of the CPU – they're still there, but they don't do anything useful any more. FS and GS are special, and effectively just give you a fast way of accessing some structure in memory.

  14. Brian_EE says:

    Where your analogy falls apart is that having a person program the radio on each and every car really doesn't scale the same was as the TEB example for NT.

  15. anonymouscommenter says:

    Having a huge number of threads is a "code smell" like having a huge number of handles, IMO.

    Not to pick on Microsoft, but on my PC (currently idle, not doing anything special, rather standard setup) Outlook 2010 uses 50 threads and 10000 handles.

    It's a bit sad when one considers all the good and performant ways to do async work in Windows (like IOCP and thread pools)..

  16. anonymouscommenter says:

    @St,

    I know AMD choose to keep FS and GS selectors, my question was why MS decided to still use them, why not just make the only valid selector "0" and solve the TEB problem some other way? Other architectures don't have selectors so Windows must do this in another way, why keeping this perk on x64?

  17. anonymouscommenter says:

    Other architectures have lots more registers, and those that don't (ARM), use other means (system call?) to retrieve thread local data, at least with other OSs (Linux, *BSD).  Windows is likely the same.

  18. anonymouscommenter says:

    @Joshua:

    "Win 95 cannot possibly hit the limit as this requires > 128MB RAM, which causes Win95 to crash on boot."  Really?  128 megabytes of RAM is too much for Windows 95?

  19. anonymouscommenter says:

    DWalker: I believe the actual limit was 512MB, but I think there was a workaround to allow more.

  20. Myria says:

    @EduardoS: On Windows RT and Windows Phone, whose architecture is ARMv7-Thumb2, an example I just tried implemented NtCurrentTeb() as "mrc p15, 0, r1, c13, c0, 2".  In other words, for thread-local storage, they read a value from a control register into a regular register, r1 in this case.  Some control registers are accessible from user mode on ARM.  This one might be read-only from user mode; I'm not sure.

    I imagine that on x86, the FS/GS segment registers are just used because they are convenient and not used for anything else useful.  (If they didn't exist, ES would be the best alternative, but using it would be annoying, because ES is used implicitly by the stos* and movs* opcodes.)

  21. St says:

    @EduardoS – you really need to use GS on x64 if you want to use the fast syscall/sysret instructions to get in and out of the kernel, which you do because they're fast. When the CPU executes a syscall instruction it doesn't update any of its state, and just switches the privilege level and jumps into the kernel, so the first thing you have to do is to get to a kernel stack. None of the CPU state can be trusted at that point – including the current stack pointer – except the kernel mode GS base which you swap in with the swapgs instruction. You do a swapgs, and then use whatever structure you have pointed to by GS to find your stack so that you can continue operating. Also, accessing the TEB through FS and GS is fast – other processors have similar concepts, such as the control registers on ARM. Not using these facilities out of a sense of purism and using something slower instead would be an odd choice. I gather that Linux does much the same thing, for much the same reasons.

  22. anonymouscommenter says:

    If this appears, it is possible to post a comment after comments are closed. Just a curiosity.

Comments are closed.

Skip to main content