We recently dealt with an interesting issue that I would like to share, hope you enjoy. - Jason
Customer reports that within their Citrix server farm (running on Windows Server 2003), when a user logs into or out of a session (seems more pronounced on logoff), ALL users connected to the server experience a ~5-20 second hang. The hang is described as follows:
- Application in the session (i.e. Outlook and Word) stop accepting keyboard input. When the hang subsides, the typed characters show up all at once.
- If applications are moved, they do not redraw
- Cannot click the start menu
- If the user was running performance monitor (in an attempt to diagnose), there would be a gap in the perfmon data that directly correlated to the hang duration
Customer has found that during the timeframe of the hang, Winlogon.exe (in session 0) is showing 25% CPU usage.
Troubleshooting & Analysis
Where to start… to begin, given the details above, we can assume we are not experiencing a hard hang where the server is completely unresponsive. The user can still move windows and the connection to the server is still active. Keyboard input is working, but seems to be buffered during the issue. The windows not redrawing indicates the message pump for each of the affected applications is stuck waiting for something.
In this scenario, I focused on the CPU usage for Winlogon under the premise that the other symptoms were side effects of whatever the CPU was working so hard on. On to the 25%; this is an interesting number to spike at. Generally speaking a thread has work to do, or it doesn't. If there was work to do it would normally take the CPU to 100%. So why 25%? In this scenario, each of the servers within the server farm had 4 processors. So we pegged 1 out of the 4 @ 100% resulting in task manager displaying 25% CPU utilization for the Winlogon in question.
So now we have a CPU spike in Winlogon. Why only the Winlogon in session 0? Going back to the issue summary, the customer reported that if ANY user logs off, ALL users experience a hang, and Winlogon in session 0 spikes. First, let's talk about what winlogon does for a living.
Sidebar: Winlogon on Windows Server 2003.
Any time you ask what a feature is or how it works, you should begin your research with the Windows Technical Reference (try a Live search like "winlogon site:http://technet2.microsoft.com"). In this case I focused on the "How Interactive Logon Works" article.
This article starts with:
The Windows Server 2003 interactive logon architecture includes the following components:
- Graphical Identification and Authentication (GINA) dynamic-link library (DLL)
- Local Security Authority (LSA)
- Authentication packages (NTLM and Kerberos)
So Winlogon has something to do with Interactive logon.
What else can we find out about the responsibilities of Winlogon
- Registers the SAS (the secure attention sequence, a.k.a CTRL-ALT-DEL) during the boot process to keep other programs and processes from using it
- SAS routine dispatching - When Winlogon recognizes a SAS event or the GINA delivers a SAS, Winlogon calls one of the SAS processing functions of the GINA.
- Desktop lockdown - Winlogon helps prevent unauthorized user's from gaining access to system resources by locking down the computer desktop. At any time, Winlogon is in one of three possible states: logged on, logged off, or workstation locked
- User profile loading - After a successful logon, Winlogon loads user profiles into the HKEY_CURRENT_USER registry key.
- Screen saver control - Winlogon monitors keyboard and mouse activity to determine when to activate screen savers.
- Multiple network provider support - If there are multiple network providers installed on a Windows-based system, they can be included in the authentication process and in password-updating operations
Looking at this list of tasks performed by Winlogon, we need to try to determine which task aligns to our symptoms (hang during logon and logoff). I decided to focus on User Profiles due to the relationship between profiles and login and logoff.
What is a User Profile? Back to Live with "user profiles site:http://technet2.microsoft.com". Out of the results, I went with the "User Profile Structure" link. The page provides the following detail:
A user profile consists of:
- A registry hive. The registry is a database used to store computer- and user-specific settings. Portions of the registry can be saved as files, called hives. These hives can then be reloaded for use as necessary. User profiles take advantage of the hive feature to provide roaming profile functionality. The user profile registry hive is the NTuser.dat in file form, and is mapped to the HKEY_CURRENT_USER portion of the registry when the user logs on. The NTuser.dat hive maintains the users environment preferences when the user is logged on. It stores those settings that maintain network connections, Control Panel configurations unique to the user (such as the desktop color and mouse), and application-specific settings. The majority of the settings stored in the registry are opaque to user profiles settings are owned and maintained by individual applications and operating system components.
- A set of profile folders stored in the file system. User profile files are stored in the file system in the Documents and Settings directory, in a per user folder. The user profile folder is a container for applications and other operating system components to populate with subfolders and per-user data, such as shortcut links, desktop icons, startup applications, documents, configuration files and so forth. Windows Explorer uses the user profile folders extensively for special folders such as the users desktop, start menu and my documents folder.
With this we can now look at the Winlogon that is spiking the CPU. My step when looking at a CPU spike for a process is to determine if the time spent is in User-mode or Kernel mode. In task manager on the Performance tab you can monitor CPU usage. If you select the View menu and chose to "Show Kernel Times" you can get additional detail showing you if the time is associated with a User-mode module or a Kernel module. In this case the Winlogon spike showed the spike to be in Kernel time. This means the application (Winlogon) asked the OS to do something and it is trying to get it done.
My second step is to determine what user-mode request led to the spike in Kernel time. To answer this on a production system I chose to use Sysinternal's Process Explorer. I downloaded the tool and set it up so it would be able to obtain symbols from the public MS symbols server. During the repro I monitored the thread CPU time in Winlogon. Here is a screen shot of what Process Explorer looks when looking at the thread activity of an idle Winlogon.
Figure 1 – Process Properties in Process Explorer
When the CPU spike occurred we looked at the stack for the spiked thread (double click on the thread with the high CPU time or click the "Stack" button with the thread highlighted):
Just to make sure, we repro'd again and looked at the call stack
We can see here that both call stacks for the thread in question are RPC threads waiting for incoming requests. The request that came in was from userenv module running in Winlogon associated with the session of the user logging off and came into the Winlogon in session 0. The request is to unload the User Profile (highlighted above). This led to a call to RegUnloadKey call which removes a hive from the registry but does not modify the file containing the registry information.
We now know the issue is related to unloading the user profile and specifically the registry portion of the user profile. Now we need to determine where we are spending the time in kernel mode.
Kernrate is a tool included with the Windows 2003 resource kit. This will let us dig into the kernel mode side of what is going on. I collected the following data from kernrate during the hang condition (Summarized):
Parsing the output, we see that 72.66% of the Kernel time was spent in the idle process. This is attributed to the 3 idle threads that are on the 3 processors not being used by the Winlogon thread. We see that the Winlogon process accounts for 20.39% of the kernel time (the CPU spike on the 4th processor). If we move into the module summary, you can see the correlated CPU times for the modules, INTELPPM (CPU driver) running on the idle processors and NTOSKRNL running on the 4th processor (this is where we will focus). In the final drill down, we see the function summary for NTOSKRNL and we can identify the function in use 82% of the time was CmpCleanUpKCBCacheTable. The Cm function prefix lets us know this is related to Configuration Manager (the Registry – see Table 2-7 from Microsoft Windows Internals 4th Edition for Commonly used Kernel Prefixes).
What do we know? We know the issue manifests most during user logoff. We know during the logoff we are trying to unload the profile. We know that in the process of doing this we are trying to unload a registry hive. We know that leads us to spend a lot of CPU time in the Kernel doing CmpCleanUpKCBCacheTable.
Why does this hang the machine? The registry is protected / synchronized with an ERESOURCE named CmpRegistryLock. While this is held exclusively during this cleanup function, all registry access is blocked (both read and write). This explains all of our symptoms. Applications freeze and do not redraw due to operations that need registry access being done on the GUI thread of a process. Perfmon cannot report due to its dependency on the registry.
Now that we know the exact behavior, we could align it to a known issue that was introduced with the release of Windows Sever 2003 SP1 (applies to both SP1 and SP2)
From the KB cause section:
This problem occurs because of the way that Windows Server 2003 Service Pack 1 (SP1) cleans up user registry information. When a user logs off, there is a five-second delay before Windows Server 2003 SP1 flushes the user registry hive. If the terminal server experiences heavy registry activity, the cached registry information on the terminal server may increase significantly. For example, this behavior may occur when a program scans the user profile during the logoff process. When the terminal server experiences heavy registry activity, it takes Windows longer to flush the data.
The fix applies to Windows Server 2003 with SP1 or SP2.