DNS Slow to Start?

Hi my name is Jason. I’m an escalation engineer on the Microsoft critical problem resolutions platform team. We recently dealt with an interesting issue that I would like to share with you and show how we went about discovering the cause. 

The customer stated that when his server (Windows Server 2003 SP1) boots up the DNS Server service does not become availabe/responsive for ~5 minutes. Customer also stated he has other DC's in this domain that are the same model server (same specs too) that do not have the problem. So I ask, "What do you mean not responsive?". During this 5 minute period, the administrator cannot use the DNS MMC to administer the server, and DNS lookups to this server fail with a timeout error. The customer can: Ping the server, connect to file shares, etc.

Troubleshooting

So my initial question was "What is DNS doing for 5 minutes?". In order to determine where dns.exe was hung up or spinning his wheels, I replaced dns.exe with the checked build.

Walkthrough:

1. Downloaded the checked build of SP1

2. Extracted and expanded the checked build of dns.exe (What is a checked build?)

3. Worked through the process of replacing the free build of DNS.exe (Booting into safe mode in order to bypass WFP (Windows file protection))

4. Created the dnsdebug file (a text file with no extension under %windir%\system32\dns). The checked build of DNS looks for this file, and reads the contents to determine the level of output. 

5. In the file I set the debug flags for DNS_DEBUG_INIT (0x00000010 ) & DNS_DEBUG_INIT2 (0x01000000) and set the output to go to the debugger (0x00000002), net result (0x01000012). The image below shows contents of the file.
 

6. Instead of using the debugger I decided to use DebugView from Sysinternals to view the output. I could have attached a debugger to DNS.exe to see the output, but in this case the DebugView was sufficient.

 

With all this in place we rebooted. We launched DebugView and did a "net start dns". Here is a screen scrape of what the output looked like:

image

 

I have highlighted the debug flag I have set.

From this output we were able to determine that the Server was going through the following process:

1. DNS gets started by Service Control Manger (a.k.a. SCM) and then DNS communicates back to SCM (across the NtControlPipe)

2. DNS loads the zones from the registry (HKLM\Software\Microsoft\Windows NT\Current Version\DNS Server\Zones

3. DNS pulls this zones from the Active Directory (as this is an AD integrated zone)

 

During these phases the ability to respond to administrative requests and name lookup requests is blocked (this is by design). So based on this I needed to figure out where we were spending the most of our 5 minutes.

1. For the SCM portion, I used Filemon monitoring only Named Pipes for dns.exe and services.exe.

2. For the registry portion, I used Regmon with a filter for dns.exe

3. For the AD portion, I used perfmon with the NTDS>LDAP Searches/Sec counter

With these running we were able to isolate the delay to the point where DNS was pulling the zones from AD. Here is a scrape of me running this test on my VM:

image

You can see (top left) the service communicating back to SCM across the NtControlPipe. Lets take a moment here and have a short discussion how a service starting works.

1. SCM goes to a registry value that will be used as a seed value to build a named pipe. (HKLM\System\CurrentControlSet\Control\ServiceCurrent\(default) = <a number>
2. The result is incremented by 1 and then returned back to SCM
3. SCM then creates a named pipe using NtControlPipe as the prefix followed by the numerical value from the registry. So when starting the first service, the pipe name would be NtControlPipe1 (To get a peek at these pipes download PipeList from Sysinternals)
4. The SCM then starts the service process
5. The service then goes to the same registry value and uses this seed to know which named pipe he will need to talk to in order to convey progress back to SCM.
 

We can then see (top right) the service pulling in the dns zone information from the registry. Finally we see (bottom) the small spike in LDAP searches/sec (highlighted) where the service was pulling the zone information from the AD. Until the pull of the zones from the registry and AD is complete, the DNS service does not respond to MMC administration or DNS requests.

The majority of the 5 minute delta fell into the last bucket (the LDAP queries to pull the zone data). So begins our next discussion, What causes LDAP to be slow?

So LDAP lives in LSASS.EXE. There are two LDAP interfaces in AD. The entire directory is available of TCP port 389, and then the subset of AD that is published (a.k.a. Global Catalog) over TCP port 3268. The DNS services is pulling the DNS zones from the "Entire Directory" interface.

Lsass memory usage on domain controllers has two components:

  • Data structures, which are like those in other processes (i.e. threads, heaps, and stacks)
  • Database buffer cache, which consists of database pages and index pages for the directory ß This is where a "In-RAM" copy of AD database would be

In Windows Server 2003, there is no limit to how large the database buffer cache can grow (outside of the limits of the 2GB of virtual address space that can be modified via the /3GB switch). This helps explain the behavior of lsass.exe normally being the largest value in the "Mem Usage" category in task manager on a DC.

Q. What can affect the amount of Database buffer cache that LSASS is maintaining "In-RAM?
A. Memory pressure. This could be caused by configuration (pagefile settings, etc.) or another application(s) using and paging a lot of memory on the machine.

lsass.exe on the affected machine was using 300mb in the "Mem Usage" column in task manager. The machine that was not affected was closer to 450mb. This means there was 150mb more data in the Database buffer cache, and due to this, the process is able to respond a lot faster to LDAP requests since he can pull from the RAM cache more and this limits the amount of time spent pulling this data into the cache from the DIT file on the disk.

The solution is to tune the pagefile settings to match the better performing server.

By simply increasing the pagefile from ~1 gig to ~3gig the DNS Server can be available within ~30 seconds

So what are we (Microsoft) doing about this?

Excerpt from the "Book of Longhorn"

What new DNS functionality is provided by this feature in Windows Server "Longhorn"?

The DNS Server service in Windows Server "Longhorn" includes a number of new and enhanced features compared to the DNS Server service that was available in the Microsoft Windows NT® Server, Windows 2000 Server, and Windows Server® 2003 operating systems. The following sections describe these features.

 

Background zone loading

Very large organizations with extremely large zones that store their DNS data in AD DS sometimes discover that restarting a DNS server can take an hour or more while the DNS data is retrieved from the directory service. The result is that the DNS server is effectively unavailable to service client requests for the entire time that it takes to load AD DS-based zones.

A DNS server running Windows Server "Longhorn" now loads zone data from AD DS in the background while it restarts so that it can respond to requests for data from other zones. When the DNS server starts, it:

  • Enumerates all zones to be loaded.
  • Loads root hints from files or AD DS storage.
  • Loads all file-backed zones, that is, zones that are stored in files rather than in AD DS.
  • Begins responding to queries and remote procedure calls (RPCs).
  • Spawns one or more threads to load the zones that are stored in AD DS.

Because the task of loading zones is performed by separate threads, the DNS server is able to respond to queries while zone loading is in progress. If a DNS client requests data for a host in a zone that has already been loaded, the DNS server responds with the dat (or, if appropriate, a negative response) as expected. If the request is for a node that has not yet been loaded into memory, the DNS server reads the node's data from AD DS and updates the node's record list accordingly.

Why is this functionality important?

The DNS server can use background zone loading to begin responding to queries almost immediately when it restarts, instead of waiting until its zones are fully loaded. The DNS server can respond to queries for the nodes that it has loaded or that can be retrieved from AD DS. This functionality also provides another advantage when zone data is stored in AD DS rather than in a file: AD DS can be accessed asynchronously and immediately when a query is received, while file-based zone data can be accessed only through a sequential read of the file.

I hope you find this information helpful.