HOWTO: Diagnose IIS6 failing to accept connections due to Connections_Refused

Recently, I have seen a bunch of questions asking: "IIS6 runs fine for X amount of time [where X varies from days to weeks] and then all of a sudden, it stops accepting all connections. If I restart/reboot the server, then it is fine again... until X amount of time later, when I have to repeat the same procedure."

Sometimes, the user even notices that HTTP Error logs in %SYSTEMROOT%\System32\LogFiles\HTTPERR\*.log mention Connections_Refused for all those requests, but what the heck does that mean?

Ok, ignoring the jokers in the back that are snickering "hmm, this is Microsoft software; what do you expect?" , this issue is actually very diagnosable and should be treatable, as I will shortly show...

Question:

Every four or five days the web site hoste on my machine will stop responding to HTTP requests on port 80.

IIS doesn't crash so I cannot get an IIS State log, just this:

Excerpt from the HTTPErr.Log:

(... working fine up until this ...)
2005-09-02 18:45:12 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:46:07 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:46:27 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:46:42 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:47:02 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:48:22 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:48:57 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:49:02 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:49:42 - - - - - - - - - 7_Connections_Refused -
2005-09-02 18:50:02 - - - - - - - - - 3_Connections_Refused -
2005-09-02 18:50:37 - - - - - - - - - 6_Connections_Refused -
2005-09-02 18:50:42 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:51:02 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:51:47 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:52:02 - - - - - - - - - 3_Connections_Refused -
2005-09-02 18:52:27 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:53:17 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:53:22 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:54:02 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:54:22 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:54:32 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:55:22 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:56:02 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:56:42 - - - - - - - - - 1_Connections_Refused -
2005-09-02 18:58:17 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:58:42 - - - - - - - - - 2_Connections_Refused -
2005-09-02 18:58:57 - - - - - - - - - 1_Connections_Refused -

I have to reboot the machine to get it to work again.

Does anyone have any ideas?

Answer:

Connections_Refused is actually a very diagnosable failure pattern, and I am going to show you how to diagnose it and what is going on.

What does Connections_Refused Mean?

According to KB 820729, Connections_Refused means that the "Kernel Non Paged Pool memory has dropped below 20MB and HTTP.SYS has stopped receiving new connections". What all this means in English is this:

  • On Windows NT systems, there are many types of memory with different properties. Kernel Non Paged Pool (NPP) memory is one such type, and the important thing to know is that on x86 machines, the size of this memory is fixed and has a maximum size of 128MB. In other words, you may have an x86 machine with 4GB of RAM, but if you only use 128MB and that 128MB all comes from NPP memory, your system is still "out of memory" regardless of how much is actually in use.
    [1/4/2006 - Correction - maximum size on x86 is "around 250MB" and 128MB with /3GB. For Windows Vista, it will be bounded by physical RAM and not arbitrary limit.] 

  • HTTP.SYS, being a kernel-mode driver, uses NPP memory for every single connection that is active, and we made a conscious decision to have it stop accepting connections when NPP memory falls below a threshold, arbitrarily set at 20MB. This is reasonable because lots of bad things can occur if NPP memory is depleted.

    For example: HTTP.SYS is likely the most active codebase in a web server, and if NPP memory gets depleted and HTTP.SYS needs NPP memory to accept a connection, the machine will blue screen. Now, regardless of the driver leaking NPP memory, HTTP.SYS looks like the culprit because it is most active. In any case, all of this is not good, so HTTP.SYS proactively refuses to get near the edge and plays it safe.

Hopefully, it is now clear what Connections_Refused means. HTTP.SYS is basically telling you "someone is using up a lot of NPP memory, and for protective reasons, I am going to stop accepting requests". We need to figure out what driver is using up all the NPP memory and address it, and the Connections_Refused should naturally go away because NPP memory will not be under pressure.

How to Diagnose this Issue

One of the nice properties of NPP memory is that a unique tag is associated with every piece of this memory, so we can always track down who is using what and how much of NPP memory. The user-mode tool to do this is called POOLMON.EXE (if you have a Kernel Debugger attached, you can use the !poolused  command to get this info within the Kernel Debugger). Here are some additional bits of related information:

I am now going to step through the POOLMON output of a user who had Connections_Refused and comment on what is going on. Here is a sample snippet of running POOLMON.EXE -b :

  Memory: 1048016K Avail:  438396K  PageFlts:   219   InRam Krnl: 3604K P:42372K
 Commit: 538908K Limit:11245092K Peak: 609576K          Pool N:109056K P:43740K
 System pool information
 Tag  Type     Allocs            Frees            Diff   Bytes       Per Alloc

 ISil Nonp    1408366 (  56)   1327427 (  58)    80939 36678632 (  -872)   453<br> I100 Nonp   11048877 ( 217)  10967968 ( 219)    80909 14886928 (  -368)   183
 PcNw Nonp     292427 (   0)     88967 (   0)   203460 12203296 (     0)    59
 SavE Nonp    1596066 (   0)   1595659 (   0)      407 11717016 (     0) 28788
 IAM  Nonp    1946164 (  49)   1904725 (  50)    41439 8519576 (  -208)    205
 tdLL Nonp      74748 (   0)     69830 (   4)     4918 2672832 ( -1776)    543
 LSwi Nonp          1 (   0)         0 (   0)        1 2576384 (     0) 2576384
 R200 Nonp         22 (   0)         1 (   0)       21 2297816 (     0) 109419
 MmCm Nonp        302 (   0)        48 (   0)      254 2236960 (     0)   8806
 TCPt Nonp     190486 (   4)    190462 (   4)       24 1392264 (     0)  58011
 ULHP Nonp       3127 (   0)        41 (   0)     3086 1047216 (     0)    339
 PTrk Nonp     999750 (   0)    996561 (   0)     3189  956824 (     0)    300
 File Nonp    9628160 (  65)   9623252 (  65)     4908  750688 (     0)    152
 rg81 Nonp       6910 (   0)      5294 (   0)     1616  603864 (     0)    373
 Pool Nonp          6 (   0)         3 (   0)        3  602112 (     0) 200704
 Devi Nonp        680 (   0)       262 (   0)      418  558552 (     0)   1336
 Thre Nonp     417221 (   6)    416457 (   6)      764  476736 (     0)    624
 LSwr Nonp        128 (   0)         0 (   0)      128  416768 (     0)   3256
 Mm   Nonp         26 (   0)         2 (   0)       24  379880 (     0)  15828
 AfdC Nonp      21989 (   0)     19872 (   1)     2117  338720 (  -160)    160
....

Some observations:

  • Notice that this system has 1GB RAM but still only 128MB NPP memory (109MB has been used, hence tripping the 20MB marker of HTTP.SYS).
  • The NPP memory tags of HTTP.SYS all start with "UL" (for the trivia-inclined: as with all software projects, HTTP.SYS has gone through some evolution of its own. It used to be called something else - Universal Listener - hence its tag names all started with "UL").
  • Notice that HTTP.SYS is not even in the top-10 NPP memory usage on the system, and its largest set of allocations is barely using 1MB total. As it turns out on this system, TOTAL NPP memory usage by HTTP.SYS was under 2MB, so clearly, it is not the aggressive user of NPP memory that is now causing issues.
  • Meanwhile, focus your attention on the top four users of NPP memory, with tags "ISil", "I100", "PcNw", and "SavE". I looked up drivers for those tags, and I only find that "PcNw" comes from Microsoft and is associated with WDM audio. The others are all 3rd party drivers.
  • Umm, this is a server, so is audio really that important? I checked on my system, and it was at 275KB, so clearly this user is/has done some heavy-duty audio stuff on this server. Thus, the memory may be legitimately used. You know, you might want to reconsider the effects of playing your MP3s on your Web Server's ability to accept connections...
  • Meanwhile, Isil and I100 have suspiciously similar active allocations and ISIL is the stock ticker for a semiconductor company, so my guess is that I100 is related to a Fast Ethernet card... and since web servers totally rely on the network driver, having a memory-hog for a network driver cannot be a good thing and should be fixed. My guess for SavE is probably anti-virus.
  • Now, I may be totally wrong with those arbitrary tag names, but one thing is for certain - IIS6 and HTTP.SYS is running just fine on this system. There are some other memory-hungry drivers running on the system, and if you get fixes for them such that your system will stay running longer, IIS6 will likely keep running as well

Note: Don't shoot the messenger (HTTP.SYS)! Always strive to figure out the root of your problem and address that. The rest of the problems should naturally fall away.

//David