Why does Win32 even have Fibers?

Raymond's had an interesting series on fibers (starting here), and I thought I'd expand on them a bit.

Fibers were added to Windows (in NT 3.51 SP3, IIRC) because some customers (not just SQL server) believed that they could improve the performance of their server applications if they had more control over their threading environment.

But why on earth did these customers want fibers?

Well, it's all about scalability, especially on MP system.  On a multitasking system, it's easy to forget that a single processor can only do one thing at a time.

The ramifications of this are actually quite profound.  If you've got two tasks currently running on your system, then your operating system will have to switch between each of them.  That switch is called a context switch, and it can be expensive (just for the sake of argument, let's say a context switch takes 5000 instructions).  In a context switch, the operating system has to (at a minimum):

  1. Enter Kernel Mode
  2. Save all the old threads registers.
  3. Acquire the dispatch spinlock.
  4. Determine the next thread to run (if the next thread is in another process, this can get expensive)
  5. Leave the dispatch spinlock.
  6. Swap the old threads kernel state with the new threads kernel state.
  7. Restore the new threads registers.
  8. Leave Kernel Mode

That's a fair amount of work to perform (not outrageous, but not trivial).

The OS won't do this unless it has to.  In general, there are three things that will cause the OS to cause a context switch are (there are others, like page faults, but these are the big ones):

  1. When your thread goes to sleep (either by calling Sleep() or calling WaitFor[Single|Multiple]Object[s])
  2. When your thread calls SwitchToThread() or Sleep(0) (this is a special case of the Sleep() API that is identical to SwitchToThread())
  3. When your thread's quanta elapses.

A thread's quanta is essentially the amount of time that the OS will dedicate to a thread if there's another thread in the system that can also run.  A quantum is something like 5-10 ticks on a workstation and 10-15 on server, and each tick is typically somewhere between 10 and 15 milliseconds, depending on the platform.  In general, your thread will get its full quanta unless there is a higher priority runnable thread in the system (please note: this is a grotesque simplification, but it's sufficient for the purposes of this discussion).

The thing is, for a highly scalable application, context switches are BAD.  They represent CPU time that the application could be spending on working for the customer, but instead is spent doing what is essentially OS bookkeeping.  So a highly scalable application REALLY wants to reduce the number of context switches.  If you ever have a service that's performing poorly, one of the first things to look for is the number of context switches/second - if it's high (for some value of high), then there's invariably a scalability issue in the application that needs to be addressed.

So why fibers?  Because for highly scalable applications, you want each of your threads to get their full quanta - in other words, you want the only reason for a context switch to be reason #3 above. 

Remember the first cause of context switches: Calling WaitFor*Object.  What that means is that if you call EnterCriticalSection on a critical section with contention, then you're highly likely to cause a context switch. The same thing happens when you wait for an I/O to complete, etc.  You absolutely want to avoid calling any Win32 APIs that might block under the covers.

So fibers were created to resolve this issue.  A fiber is effectively removes steps 1, 3, 5 and 8 from the context switch steps above, switching from one fiber to another just saves the old register state, and restores the new register state.  It's up to the application to determine which fiber runs next, etc.  But the application can make its own choices.  As a result, a server application could have a dozen or more "tasks" running on each thread, and they'd radically reduce their context switch overhead, because saving and restoring registers is significantly faster than a full context switch.  The other thing that fibers allow is the ability to avoid the dispatcher spin lock (see John Vert's comment about context switches being serialized across all processors below).  Any global lock hurts your scalability, and fibers allow an application to avoid one of the global locks in the system.

Ok, so why have fibers remained obscure?

They've remained obscure first because of the reasons Raymond mentioned in his fibers caveat here - using fibers is an all-or-nothing thing, and it's not possible to use fibers from a shared library.  As Rob Earhart pointed out in this comment on Raymond's post, some of the idiosyncrasies of the fiber APIs have been resolved in the current versions of Windows.

They're also HARD to deal with - you essentially have to write your own scheduler.

Raymond also left off a couple of other gotchas: For example, if you're using fibers to improve your apps scalability, you can't call ANY Win32 APIs that might block (including filesystem APIs) because all the Win32 blocking APIs are also have thread affinity (not surprisingly :)) So if you're running 20 fibers on a single thread, when any of the fibers blocks, your thread blocks (however, the fibers can be run from another thread, because fibers don't have thread affinity, so if you have a spare thread around, that thread can run the fibers).

The other reason that fibers have remained obscure is more fundamental.  It has to do with Moore's law (there's a reason for the posts yesterday and the day before).

Back when fibers were first implemented, CPUs were a lot slower.  Those 5000 instructions for the context switch (again, this is just a guess) took .05 millisecond (assuming one cycle/instruction) to execute on a 100MHz machine (which would be a pretty fast machine in 1995).  Well, on a 2GHz machine, that .05 is .0025 millisecond - it's an order of magnitude smaller.  The raw cost of a context switch has gone down dramatically.  In addition, there has been a significant amount of work in the base operating system to increase the scalability of the dispatcher spinlock - nowadays, the overhead of the dispatcher lock is essentially nonexistant on many MP machines (you start to see contention issues on machines with a lot of CPUs, for some value of "large").

But there's another aspect of performance that has gone up dramatically, and that's the cost of blowing the CPU cache.

As processors have gotten smarter, the performance of the CPU cache has become more and more critical to their speed - because main memory is painfully slow compared to the speed of the processor, if you're not getting your data from the CPU's cache, you're paying a huge hidden penalty.  And fibers don't fix this cost - when you switch from one fiber to another, you're going to blow the CPU cache.

Nowadays, the cost of blowing the cache has leveled the playing field between OS context switches and fibers - these days, you don't get nearly the benefit from fibers that you did ten years ago.

This isn't to say that fibers won't become useful in the future, they might.  But they're no longer as useful as they were.

Btw, it's important to note that fibers aren't the ONLY solution to the thread quantization issue mentioned above.  I/O completion ports can also be used to limit context switches - the built-in Win32 thread pool uses them (that's also what I used in my earlier post about thread pools).  In fact, the recomendation is that instead of spending your time rewriting your app to use fibers (and it IS a rewrite), instead it's better to rearchitect your app to use a "minimal context" model - instead of maintaining the state of your server on the stack, maintain it in a small data structure, and have that structure drive a small one-thread-per-cpu state machine.  You'll still have the issue of unexpected blocking points (you call malloc and malloc blocks accessing the heap critical section), but that issue exists regardless of how your app's architected.

If you're designing a scalable application, you need to architect your application to minimize the number of context switches, so it's critical that you not add unnecessary context switches to your app (like queuing a request to a worker thread, then block on the request (which forces the OS to switch to the worker, then back to the original thread)). 

Significant Edit (1/10/2005): Fixed several issues pointed out by the base performance team.