DC fails logons or experiences LDAP timeouts
This was an interesting one which rolled by recently, and it’s a looong post so I apologize ahead of time.
Let’s start with the end user experience and move on from there: User(s) cannot send mail or retrieve mail from Exchange 2010 server.
Well that’s pretty simple J.
So when we looked at the Exchange servers, we see a whole slew of events from various Exchange roles and servers – I have changed the target DC name, but the DC failing to return in a timely manner also varied.
At first we wondered if the queries were in fact “bad” ( expensive or inefficient ) queries but after looking at them they were all very targeted queries.
“CN=ExchSrv-01 MBX Store 013,CN=Databases,CN=Exchange Administrative Group(BAR),CN=Administrative Groups,CN=XCH,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=DC1,DC=FOO,DC=COM"
This path only had a few objects under it and the DCs should have been able to handle massive amounts of these calls without even breathing hard, and sure enough, when the Perf data was reviewed the CPU for LSASS was not very high ( IIRC something around 20% ) during the timeframe in which the Exchange servers were falling over.
Which brings us to performance data. There was no baseline for anything in this environment , so in order to determine what “normal” was, we gathered a day or so’s worth of data
The list below may not be the exact names as I am typing them from memory, but you get the picture (I hope).
· Overall System CPU
· LSASS CPU
· Process LSASS Private bytes
· NTDS (all)
· Logical Disk
· Physical Disk
However, none of this data was captured for a baseline, or from a DC during an outage.
So, I wrote a rudimentary LDAP stress tool which you can spin up X amount of threads in order to hit the DCs with various LDAP searches. We wanted to use this during off hours in order to determine the impact on the DCs, because at the time we did not have any perf data from an actual outage and wanted to get some more data before the next day.
I use Perfmon to trend and analyze high level data – think of it like day to day trending and analysis, and use something like SPA ( In 2k8 it’s in Perfmon under “Data Collector Sets”) see here for more info on it. I think of SPA more like a microscope, when you want to zoom in on a period of time where you know, or suspect, something is going on. You would never (IMO) want to take a day’s worth of SPA data to plow through.
So, we ran some LDAP stress against the DCs, trying to make them cry uncle and give us a timeout but it did not. Perhaps it just wasn’t generating enough load (even at 800 threads from multiple instances on multiple clients).
What we did see, was interesting. We can see that the stress tool was indeed giving us some perf data increase, and one of the more interesting counters is the ATQ (asynchronous thread queue) counters :
· \NTDS\ATQ Outstanding Queued Requests
· \NTDS\ATQ Request Latency
· \NTDS\ATQ Estimated Queue Dela
NOTE: ‘normal’ data is on the left ( may not be entirely normal as this is still a Saturday afternoon and lower than expected numbers)
Looking at ATQ Outstanding Queued Requests to see how many requests are queued at the domain controller.
A high level of queuing indicates that requests are arriving at the domain controller faster than they can be processed. This can also lead to a high latency in responding to requests. Delay is the estimated time (in milliseconds) the next request will spend in the queue prior to being serviced by the directory service, 1.265 seconds.
So as the stress load increased we see that the delays in processing goes up – perhaps to the point of failure?
By default there are 4 threads per processor allocated to the LDAP thread pool, we can change that via LDAP policies, specifically MaxPoolThreads:
· MaxPoolThreads = Maximum number of threads created by the domain controller for query execution (4 per processor)
Theory at this point was that we were clogging up the thread pool and more throughput here would improve performance as I did not believe we were not constrained by Disk\CPU\Memory on the systems. Processor, Disk IO counters, and Database (\Database(lsass)\Database Cache % Hit) give us some insight to my reasoning here.
We increased the threads to 8 per proc on a test DC and re-ran the ldap stress.
CPU usage is still within normal ranges ~20%.
We have lowered queue delay and increased ( nearly doubled ) DS search operations ( previous stress is on left , post 8 threads is on the right , note we also had the VM at one proc previously and this is now two)
At this point we wait for the next morning to hit, with Perfmon ready to go, SPA ready to go, and ready to get some LSASS.EXE dumps via procdump.exe if the issue occurs again.
Sure enough, Exchange goes down in spectacular flames…
We got the SPA data, LSA dumps etc. and in hindsight, some of the DCs had not been rebooted after making the LDAP policy change and so were still at the default 4 threads.
We examine the actual outage perf data and sure enough, we see that the queue delay is quite high, up to ~4 seconds as seen in the perf graph below.
But why? Why can this DC not handle the throughput here? Typically I see CPU increase when a DC in under serious load, and these DCs weren’t breathing hard.
The SPA data also showed a large number of LDAPPing requests which took ~2.2 seconds:
In addition, the raw SPA data showed more information ( client addresses )
Now this was a bit interesting…Let’s look at the LSASS process dump.
I saw all 8 ATQ threads and all of them were in this state:
Ahh very interesting.
Netlogon of course, processes DC locator and DSget* API’s. So, let’s recap what we have here.
· Exchange servers fall over from LDAP delays
· DCs have a default of four threads processing data
· DCs are not CPU constrained and LDAP queries are not expensive or inefficient
· DCs have some clients doing a DC discovery and sending the standard “LDAPPING” as documented in the [MS-ADTS]: Active Directory Technical Specification, you can see this in a network trace as well
· DCs threads are all gummed up in waiting to resolve name?
Upon further investigation we see the following in the DCs netlogon logs:
At the end of the day, after some additional code review and proving it out in a lab, the DC is trying to resolve the site to client mapping and if it does not find a subnet for the client, and can not resolve the name in DNS, it will try NetBios which takes ~2.5 seconds to timeout and a the number of calls from clients which fail in this manner, combined with the number of LDAP requests Exchange needed just to handle the morning logons – plugs the few default ATQ threads with client LDAP pings working through netbios name resolution.
The workaround in this case was to disable Netbios on the DC but this may not be an option for everyone, so the site subnet mapping or DNS name resolution should also fix this kind of an issue.
KEYWORDS: LDAP search hang delay netbios ATQ LDAP timeout