What is IRQL?


Jake Oshins wanted to write about IRQLs and I am gladly letting him use my blog as a platform.  Here it is…


I’ve found myself explaining IRQL a lot lately, sometimes to people who want to know because they’re trying to write Windows drivers and sometimes to people who are accustomed to Linux or some other variant of Unix and they want to know why something like IRQL is required within Windows when those systems so clearly get by without it.


Penny Orwick covered this topic before, in the following two papers, with a lot of help from me and some others:


http://www.microsoft.com/whdc/driver/kernel/irql.mspx


http://www.microsoft.com/whdc/driver/kernel/locks.mspx


I’ll try to do it a little more briefly here.


Computers have many things within them that can interrupt a processor.  These include timers, I/O devices, other processors, internal processor performance counters, etc.  All processors have an instruction for disabling interrupts, somehow, but that instruction (cli in x64 processors) isn’t selective about which interrupts it disables.


The people who built DEC’s VMS operating system also helped design the processors that DEC used, and many of them came to Microsoft and designed Windows NT, which was the basis for modern versions of Windows, including Windows XP and Windows 7.  These guys wanted a way to disable (very quickly) just some of the interrupts in the system.  They considered it useful to hold off interrupts from some sources while servicing interrupts from other sources.


They also realized that, just as you must acquire locks in the same order everywhere in your code to avoid deadlocks, you must also service interrupts with the same relative priority every time.  It doesn’t work if the clock interrupts are sometimes more important than the IDE controller’s interrupts and sometimes they aren’t. 


Interrupts are frequently called “Interrupt ReQuests” and the priority of a specific IRQ is its Level.  These letters, all run together, are IRQL.


So if you lay out all the interrupt sources in the system and create a priority for each one, or sometimes a priority for each group, you can start to do interesting things. 


Consider a spinlock.  Spinlocks (at least in the traditional sense) are implemented by having a processor spin in a tight loop trying to atomically modify a variable.  The cache coherency hardware guarantees that only one processor can do that at a time, so lock acquisition goes only to the processor that succeeds.  Other processors keep spinning until they succeed.


The processor that “owns” the lock needs to release the lock as soon as possible, as the other (waiting) processors are burning up processor time waiting to acquire the lock.  So you really don’t want to interrupt that processor and schedule some other thread for execution, causing all the waiters to spin until the owning thread is rescheduled.


In this situation, some operating systems encourage the owner of the spinlock to disable all interrupts so that the code can’t be interrupted.  (Note, too, that interrupts really need to be disabled before trying to acquire the lock, or the thread might be interrupted between acquiring the lock and disabling interrupts.)


The designers of VMS and NT decided that they didn’t want to disable all interrupts just because some code somewhere acquired a spinlock.  Some things shouldn’t wait.  TLB flushes, are a good example.  So if only some interrupts are disabled while a spinlock is held, then you can still briefly interrupt the code that owns the lock for much more important tasks.  Perhaps even more importantly, you can interrupt the processors which are spinning, waiting to acquire a spinlock for these important tasks, causing them to do something useful instead of just spinning.


Note that this means that every spinlock has an associated IRQL, and you have to use that IRQL consistently, or the machine will deadlock.  In NT, by default, every spinlock has the same IRQL, called DISPATCH_LEVEL.  DISPATCH_LEVEL means, essentially, that the interrupts which can cause a thread to stop running are disabled.  (More about that later.)


Here’s a table of all IRQLs, as defined in the Windows NT header files (easily seen in the WDK.)
























































































IRQL


X86 IRQL Value


AMD64 IRQL Value


IA64 IRQL Value


Description


PASSIVE_LEVEL


0


0


0


User threads and most kernel-mode operations


APC_LEVEL


1


1


1


Asynchronous procedure calls and page faults


DISPATCH_LEVEL


2


2


2


Thread scheduler and deferred procedure calls (DPCs)


CMC_LEVEL


N/A


N/A


3


Correctable machine-check level (IA64 platforms only)


Device interrupt levels (DIRQL)


3-26


3-11


4-11


Device interrupts


PC_LEVEL


N/A


N/A


12


Performance counter (IA64 platforms only)


PROFILE_LEVEL


27


15


15


Profiling timer for releases earlier than Windows 2000


SYNCH_LEVEL


27


13


13


Synchronization of code and instruction streams across processors


CLOCK_LEVEL


N/A


13


13


Clock timer


CLOCK2_LEVEL


28


N/A


N/A


Clock timer for x86 hardware


IPI_LEVEL


29


14


14


Interprocessor interrupt for enforcing cache consistency


POWER_LEVEL


30


14


15


Power failure


HIGH_LEVEL


31


15


15


Machine checks and catastrophic errors; profiling timer for Windows XP and later releases


For driver writers, the only IRQLs that are usually interesting are 0 through 2 and DIRQL.  It’s worth mentioning, though, that the NT kernel itself internally has spinlocks at DISPATCH_LEVEL and all the levels above that.


So, now for a tour of interesting IRQLs:


PASSIVE_LEVEL


This is the level at which threads run.  In fact, if you look at the specific definition of “thread” in NT, it pretty much only covers code that runs in the context of a specific process, at PASSIVE_LEVEL or APC_LEVEL.  Deferred Procedure Calls (DPCs) are not threads, in that sense.


Any interrupt can occur at PASSIVE_LEVEL.  User-mode code executes at PASSIVE_LEVEL.


APC_LEVEL


Windows NT has an interesting mechanism for getting into a certain thread context.  You can queue an interrupt to a thread, so that your function will run on that thread’s stack, with that thread’s address space, with that thread’s local storage.  This is useful for I/O completion.  When I/O completes, you queue an APC back to the requesting thread which does the last part of I/O completion in the initiator’s address space.  It’s a neat way to solve a bunch of problems.


If you want to disable interrupts to your thread, you raise to APC_LEVEL.  At least that was the original design.  APCs and the rules around them have grown much more complicated over the years.  At this point, the best that you can say is that if you care to disable APCs, call KeEnterCriticalRegion (http://msdn.microsoft.com/en-us/library/ms801955.aspx) or KeEnterGuardedRegion (http://msdn.microsoft.com/en-us/library/ms801643.aspx.)


Your code generally won’t need to run at APC_LEVEL at all, unless you use Fast Mutexes (http://msdn.microsoft.com/en-us/library/aa490219.aspx.)  Fast Mutexes are somewhat faster than Mutexes (http://msdn.microsoft.com/en-us/library/aa490228.aspx) or other dispatcher objects because, among other things, they hold off APCs by raising to APC_LEVEL.


APC interrupts, by the way, are sent by a processor, either to itself or to another processor.  No external device is involved.


DISPATCH_LEVEL


Windows NT doesn’t have a “scheduler” in the sense that most Unix variants do.  There is no process that decides which other processes should run.  Each processor “dispatches” itself by looking at runnable threads and deciding which one to run next.  This is a scheduler, of sorts, but not the same thing that many people coming from Linux will imagine.


The dispatcher is interrupt driven, in that it won’t allow a thread to run longer than its quantum before scheduling another thread.  But the scheduling clock doesn’t generate dispatcher interrupts directly.  The clock interrupt fires at CLOCK_LEVEL, somewhat more frequently than the thread scheduling quantum.  Various housekeeping tasks happen as a result of the clock interrupt, and one of them is that a dispatcher interrupt is generated by the processor to itself.  (Actually, this internal self-interrupt is often optimized away, but the architectural result is the same as if an interrupt were generated.)


If your code raises IRQL to DISPATCH_LEVEL, you have disabled the dispatcher on that processor, and only on that processor.  This means that your thread will not be pre-empted by another thread and it will not be moved to another processor until you lower IRQL.


Since, as noted above, I/O completion depends on code running at APC_LEVEL, and since APC_LEVEL code won’t run while the processor is at DISPATCH_LEVEL, page faults can’t be resolved at DISPATCH_LEVEL.  So code that holds a DISPATCH_LEVEL lock (like a spinlock) can’t reference memory which might be paged out.


Furthermore, most of the locking primitives that the NT kernel provides are what are called “dispatcher objects” (http://msdn.microsoft.com/en-us/library/aa490210.aspx.)  You can wait on dispatcher objects until they are signaled and, while your code is waiting, the processor is free to get other work done, on behalf of other threads.  This is nice, because, in contrast with the spinlock, which consumes the processor doing no useful work while it’s waiting, dispatcher objects allow the dispatcher to find other work until the reason for waiting can be satisfied.


What this means to you, though, is that you can’t wait on a dispatcher object at DISPATCH_LEVEL.  You’ve already disabled the dispatcher.  Your only choice at DISPATCH_LEVEL is a spinlock.


DIRQL


“DIRQL” is the shorthand that many people (internal to Microsoft and external) use when they mean “the IRQL that the PnP manager assigned to my device’s interrupt, and the associated interrupt spinlock and interrupt service routine.”  When a bus driver requests an interrupt for a device (as when the PCI driver finds the Interrupt Pin register set to some non-zero value, or when it discovers an MSI-X table) it tells the PnP manager two things.  First, it says that the device needs to register an ISR or a set of ISRs.  Next it says something about how the device is attached to any interrupt controllers present in the machine.  The PnP manager picks a processor to attach the interrupt to and picks the IRQL for that interrupt.  Sometimes that choice is constrained by the way the wires are laid out on the motherboard, sometimes not.  That topic is too big for this post.  (I might go into it later.  I wrote the code.)


As you can see from the table above, there is more than one DIRQL.  Unless your device generates more than one interrupt, you don’t really have to care.  Just pass along the values that you were given.  Your interrupt spinlock’s IRQL is that which was assigned to you.  The only thing you have to know about it is that acquiring that lock means that you’ve pre-empted everything happening at lower IRQL.  You haven’t pre-empted things like TLB updates, though, as those still come in at higher IRQL.


If your device does generate more than one interrupt, and if you need one spinlock that is used for both interrupt sources, you need to register your interrupt service routines with the highest of your DIRQLs as the SynchronizeIrql, which will avoid deadlocks by guaranteeing that all your interrupt-related code runs at the highest necessary IRQL.


In summary, IRQL is a concept that was intended to allow spinlocks to be sorted into more-important and less-important buckets, so that some interrupts can occur while other interrupts are disabled.


Most people agree that this is fairly complex to work with.  Whether you believe this was a necessary addition to the driver model is the source of a debate that’s been raging on the ‘net since before Windows NT actually existed.


– Jake Oshins

Comments (15)

  1. Navneet Kaur says:

    It says : "If your code raises IRQL to DISPATCH_LEVEL, you have disabled the dispatcher on that processor, and only on that processor. "

    Now I'm confused. Does that mean other dispatcher is still enabled on other processors and can still dispatch other threads? Then how will it cause deadlock when code running at dispatch level touches paged memory. Paged memory can still be fetched in using other processors which can still run at APC_LEVEL  to complete paging io.

    Please clarify. Thanks.

  2. doronh says:

    yes, other processors can be running at passive level where the dispatcher can run and schedule other threads on them.  Your confusion is from 2 misunderstandings

    1) all of the other processors could also be at dispatch level or higher and not able to satisfy the page fault

    and more importantly,

    2) the processor that takes the page fault on a paged memory access waits synchronously for that fault to be resolved and the page to be brought in. you can't wait synchronously at dispatch level regardless of what the other cores are doing.

    d

  3. Navneet Kaur says:

    Thank you. the http://www.microsoft.com/…/irql.mspx says "Driver code that is running above PASSIVE_LEVEL (either at PASSIVE_LEVEL in a critical region or at APC_LEVEL or higher) cannot be suspended" but msdn.microsoft.com/…/ff544337(VS.85).aspx says "ExAcquireFastMutex puts the caller into a wait state if the given fast mutex cannot be acquired immediately" and if the thread enters wait state, then this thread can be suspended and another thread scheduled to run. right?

  4. doronh says:

    you are confusing 2 types of suspension.  the IRQL document on whdc that you are referring to means that you cannot suspend the thread using SuspendThread or equivalent KM APIs at passive in a crit region or at APC or higher since suspending a thread requires sending an APC to that thread and APCs are only procesed at passive outside of a critical region.  This type of suspend can be infinite (And thus a denial of serivce if a user mode thread can suspend a kernel mode thread which is holding a resource needed by others in the kernel).  The second suspend where you are put into a wait state if the fast mutex cannot be acquired is a wait on a synchronization object that will resume when the wait is satisfied. This wait should not be infinite.

    d

  5. Gerry Murphy says:

    Thanks. useful article- and I'm not a driver writer or a coder!

  6. Giri says:

    Very useful article.

    In one of  your comments, you mention " you can't wait synchronously at dispatch level regardless of what the other cores are doing."

    Why cant a thread wait synchronously at dispatch level on a given processor in a muli-processor system? I understand that on a single proc system it will cause a deadlock. But in a multi proc system, only the proc running the thread which is waiting will block and this wait can always be satisfied through another thread running on a different processor. In essence, could we theoretically still work-around accessing paged memory at dispatch level on a multi-proc system (of course only in the cases where there is at least one proc which is not at dispatch level)? Also, does Windows attempt to bring in the required page through other procs before halting the system?

    Thanks.

  7. doronh says:

    imagine if every one did this and every core was stuck waiting for paging io. you are still stuck.  just because you can get away with it on an MP system most of the time does not mean it is the right thing to do.  Besides, there is no guarantee the paging io will be processed on a different core (but that doesn't really matter).  

    The rule is there for a reason.  Like many kernel mode rules/contracts, it is not always explicitly enforced.  Bust the lack of explicit enforcement does not mean you should break the rule, it just means you won't get caught immediately.

  8. Ashwin Patti says:

    Good article.

    Can you please explain if any of this changes for win7?

  9. doronh says:

    none of this changed for win7

  10. Dave says:

    I had one basic question: why do we need IRQL.  Doesn't IRQ provide an interrupt priority scheme and can't we program the PIC for setting priority?

  11. doronh says:

    Windows runs on platforms that don't have a PIC. IRQL is the abstraction over the hardwar.  for instance, it abstracts the IO APIC as well

    d

  12. Dave says:

    Thanks Doron, that makes sense.  I was reading more related to interrupts and had couple of questions.  

    1. Where is IDT stored?  I read that IDTR register in CPU stores both the physical base address and the length in bytes of the IDT – but does HAL initialize/allocate the IDT as well as IDTR?

    2. I was also keen to find out more about MSI, I read that in MSI Model – "Device interrupts via memory write transactions on the PCI bus", I couldn't get how CPU gets interrupted by a write transaction.  I was keen to know more, will try to search for docs online, appreciate if you can give any pointers.

  13. Dave says:

    I disassembled KiInterruptDispatch and see that it raises the IRQL to the D-IRQL when an interrupt occurs.  My question is that what if an interrupt happens in the gap where Dispatch code pushes trap frame and IRQL is raisaed to DIRQL by KiInterruptDispatch.

    Also, two devices would share IRQ only when IRQ lines are limited (e.g. 15 lines due to cascaded interrupt controller).  IRQ sharing should be decided at hardware level only – right, based on the case where two devices are connected to same interrupt line of interrupt controller.  In that case, why do we have arbitration/decision making in PnpManager, HAL, etc to decide on which devices would share interrupt (e.g. ShareVector in IocConnectInterrupt doesn't make sense since the sharing should be at hardware level only – am I missing something).

    Also, I believe APIC support many more lines and hence IRQ sharing shouldn't be required for APIC systems.

  14. King says:

    Which MS team handles IRQL issues?

  15. doronh says:

    King, there is no team that handles IRQL issues per say.  IRQL is just an abstraction provided by the kernel and HAL.  what particular issue are you seeing?