Being part of the IIS/ASP.net team, you always have a competitive advantage in troubleshooting if you know how to troubleshoot related technologies that are involved with the web server. For example Security Protocols like NTLM and Kerberos. In my 3+ years of experience I have dealt with so many Kerberos issues, most common of them being authentication failure when accessing content on a file share or calling web services. It would typically manifest itself with a number: *401*.
But that's not what I want to talk about today. You will in fact find umpteen articles on technet, MS KB articles, blogs that tell you how to resolve these issues.
What I want to talk about here today is one of the unpopular issues that come into our group, simply because it would appear that IIS has stopped functioning. This typically occurs on Windows Server 2003 operating systems that also have a role as web server. In many cases they will have Exchange Services installed or be playing multiple roles within the network such as a Domain controller, Web server, Exchange Server etc.
The classic symptom in this case will be as follows. Over a period of time clients will fail to get a response from the web server and IE would show
Page cannot be displayed
Rebooting the web server usually resolves the issue, but this is a very annoying situation when you are in production. So let's see what all can be done to track down the problem. Note that this post is not exhaustive in resolving Page cannot be displayed errors.
Before I continue let me also remind you that it's a very bad idea to run a server with multiple roles, especially Domain controller, Exchange Services, Web Server, SQL Server etc. On a 32 bit platform you are limited to 2GB of address space in user mode and no matter how much RAM your server can hold, no more than 4 GB will be utilized. Exchange, SQL are all critical and services that require huge amounts of memory so don't put all of them together. Let a DC be a DC. It gets pounded by authentication requests. For a small web application consumed by few users hosting web server on a DC may be alright, but you should consider adding a new box if the user base grows.
So on with the problem...
So when users begin to see Page cannot be displayed errors, one of the first things you want to do to try & diagnose the problem is ask yourself a few questions. Eg: Wasn't this server serving pages when it restarted last time? Most likely your answer would be a yes. So why would it fail now? This is what we want to find out.
These days, most customers run ASP and ASP.net applications. I typically ask customers if they can browse simple HTML pages when the problem condition occurs. Their typical response would be a No. Note that HTML and ASP/ASP.net have a different processing pipe in IIS. IIS serves HTML, but ASP and ASP.net are implemented as ISAPI extensions. So testing if HTML pages can be browsed will help figure if the issue is ISAPI related or IIS in general. The other things you may want to find out are if new updates came on to the machine.
I pay a lot of attention to the page that clients see in Internet Explorer and the complete error message. They are your first and most of the time, the important clues that will define the path to take to resolve the problem. In this case, you would typically see this:
The page cannot be displayed.
The page you are looking for is currently unavailable. The Web site might be experiencing technical difficulties, or you may need to adjust your browser settings.
Optionally towards the end of the page you will also see
Cannot find server or DNS Error.
So... if the server cannot be found, then it is likely that there are networking issues or you were unable to get a connection from the web server. Immediate things to check will be if the server is up and running and if there are DNS resolution issues. If you are on an intranet, then trying to hit a share on the web server is a good test to see if the network packets are making it to the web server. Once you determine that traffic is reaching the web server, you can then focus your troubleshooting more on the server side of things.
In Windows Server 2003, the HTTP.SYS is the Kernel mode HTTP driver responsible for managing client connections and sending the response. IIS sits on top of this driver. When an HTTP request arrives on port 80, the HTTP.SYS driver establishes the connection with the client and then routes the request to IIS. Any errors/events that occur in the HTTP layer are logged in a text file called HTTPErr.log. The default path for this log is C:\Windows\System32\Logfiles\HTTPErr. Your troubleshooting starts here. If you see an entry in here for the affected client (use IP of client), there will be an associated reason for it. The various reasons and its meanings are documented here.
The one you are most likely to see here is "Connections_Refused". The reason for this as documented in this KB is "The kernel NonPagedPool memory has dropped below 20MB and http.sys has stopped receiving new connections". So when HTTP.SYS is no longer accepting connections, no requests will make it to IIS and therefore IIS would appear to be down; but this is not correct. The services would usually be running and idle because no connections are reaching it.
Refusing connections when you are below the 20 MB limit for NonPagedPool Memory (NPP) is the default behavior of HTTP.SYS to keep the system stable. However this can be over ridden using the steps from KB 934878 and restarting the HTTP Services. Note that this is usually like a pain killer type of fix to give us some time while we address a much broader problem. This allows the HTTP driver to accept connections as long as the Kernel Non Paged pool memory is over 8 MB. Typically done in production scenarios (where this usually occurs) to restore services while the root cause is being determined.
Once you have isolated the problem this far, you should focus on finding what is exhausting the NPP. This is where you get into the Windows side of things for troubleshooting. Usually our platforms engineers can assist you in troubleshooting this, but like I said in the beginning, it really helps to have a basic understanding of how to troubleshoot this further even though it's not something that you understand but reported the problem.
NPP is usually used by System drivers. They use NPP because this is a section of Kernel memory that will never be paged to the disk. It always stays in memory for delivering high performance. Non Paged pool memory size cannot be configured. However, using the /3GB in boot.ini lowers NonPaged Pool's maximum from 2 GB to 1 GB. As many of you know, on a 32 bit platform, the virtual address space for any application is 4 GB - 2 GB of user mode and 2 GB of Kernel mode. By using this switch you take away 1 GB from Kernel and give it to user mode. Many programs such as Exchange Server recommend using this setting. So your affected machine probably has this in Boot.ini file.
The Non paged pool memory size is 128MB with the /3GB switch & 256MB without. Conversely, Paged Pool size can often be raised to around its maximum manually via the PagedPoolSize registry setting from KB 304101.
Further troubleshooting this problem: Start by taking a snap shot of the Paged and Non Paged memory. The easiest way to find out how much NPP is being used is to look at the performance tab in Windows task manager. In this tab, under Kernel Memory (K), note the value against Nonpaged. Subtract this value from 128 if the system has /3GB switch enabled or 256 if it does not. We also have a utility called Poolmon.exe that is available with the Windows Support tools folder on the installation CD. The exe that is part of Win XP Support tools has also worked for me. To get a snapshot, launch a command prompt and run Poolmon.exe from there. Once poolmon.exe is running, you should see a whole lot of information that may overwhelm some of you, but it's really easy to review this information. I usually think of what I am suspecting and try to get data on that to prove my theory. Commands are irrelevant as you can always refer help files. In this case, we know that connections were refused because we are probably below the NPP limit of 20 MB for HTTP.SYS to provide a connection to the client. Naturally I want to find out if we are at or below this limit. We also know that with /3 GB switch NPP will be at 128 MB and 256 without.
Click anywhere within Poolmon and press the letter B on the keyboard. This essentially sorts the output by the maximum bytes usage. The top of the output will be something like this:
Memory: 6290512K Avail: 2371728K PageFlts: 2162 InRam Krnl: 3604K P:42372K
Commit: 538908K Limit:11245092K Peak: 609576K Pool N:114776K P:54740K
System pool information
Observe the value against "Pool N" in the second line of the above output. You can see that we are at 114 MB. With the /3GB switch, let's say, we are below 20 MB limit at 16 MB (128 - 114).and just below this, you will have the top memory consumers.
Tag Type Allocs Frees Diff Bytes Per Alloc
Thre Nonp 1431458 ( 7) 1322591 ( 4) 108867 67933008 (1872 ) 624
MmSt Paged 2101067 ( 11) 2098428 ( 11) 2639 5060040 (-6136) 1917
ISil Nonp 1408366 ( 56) 1327427 ( 58) 80939 36678632 ( -872) 453
I100 Nonp 11048877 ( 217) 10967968 ( 219) 80909 14886928 ( -368) 183
and so forth.
The first column in here represents a "Pool Tag". This tag is present within the driver file and on Windows 2003 systems, pool tagging is enabled by default. This helps in identifying a driver file that corresponds to this tag, but this is not guaranteed to be unique. Our HTTP.sys driver tag starts with "UL", also known as Universal Listener.
So the obvious next question would be: How to identify the owner of unknown tags?
For 32-bit versions of Windows, use poolmon /c to create a local tag file that lists each tag value assigned by drivers on the local machine (%SystemRoot%\System32\Drivers\*.sys). The default name of this file is Localtag.txt.
This KB can also help. KB 298102, How to Find Pool Tags That Are Used By Third-Party Drivers.
Most of the Windows utilized pooltags are also documented in a file called pooltag.txt that is installed or available within the Windows Resource kits. So if we see MmSt as the top tag for instance we can determine that it's the memory manager. You will also see that the topmost tag is "Thre". If you search for this tag in Poolmon.txt you will notice that it corresponds to
nt!ps - Thread objects.
This is Windows kernel's tag for Thread objects. So it is very likely here that there is a program that is creating many threads or handles. Handles reference memory locations and therefore it would make sense to look for a process that has too many handles/threads being created. You can use the Windows task manager to find the process with a high handle count.
Bring up task manager. Then select the Process tab. From the View menu, select View Columns and then check Handle Count and Thread Count. Now it should be easier to find the program with the largest thread counts or handle counts. Check if shutting down that program/service resolves the issues. If it is a Microsoft component, you should contact Microsoft for further assistance. If it is a non-Microsoft component, follow up with the vendor of that driver/program.
I recently got an interesting case on this issue. While looking at the poolmon output arranged by the biggest consumers of NPP, I found:
A Driver with the tag MmCm is taking 61990296 Bytes.
A Driver with the tag brcm is taking 9338880 bytes
Usually, the top few entries may be related.
MmCm is: - nt!mm - Calls made to MmAllocateContiguousMemory (From Pooltag.txt)
A colleague who was working with me (Shawn Jarrett from exchange team) also matched this data with some entries in the event logs and together, we figured that this is a broadcom NIC adapter. Shawn was super in thinking that since this was SP2, they have TCP Chimney offloading enabled and it will be a good idea to disable it and test. Windows Server 2003 SP2 enables it by default. If you are interested in reading more about TCP Chimney offloading, please read this article.
Essentially this allows for some better network performance but you will also need the latest driver for your NIC card for compatibility reasons.
When we disabled TCP Chimney offloading (From command prompt, run: netsh int ip set chimney DISABLED), the NPP usage reduced by over 30+ MB and stabilized. I believe this command sets the registry keys mentioned in the article.
NOTE: No reboot or restart is required for this to take effect. It will take effect immediately.
In our case, immediately, the problem was resolved and we were able to browse web pages and access Outlook Web Access.
Problems manifest themselves in many different ways. The symptoms may indicate that the problem is where you see it is, but you may be completely wrong. Always strive to figure out the root cause of the problem and address that. When you do that, everything else will fall in place.