Introducing the problem scenario
Yet another Kerberos authentication failure troubleshooting scenario. This scenario never stops paying me a visit every now and then. Same type of symptoms.
- SQL Server named instance configured to listen on a static port.
- Logged on user (in a client machine) and the SQL server machine are on two different domains.
- A domain controller has been upgraded to 2012.
- Domain trust has been the same as it was before
- SPNs exists as before, in the working scenario.
Yet, when the user tries to connect to SQL Server instance from SQL Server Management Studio, they get the familiar Kerberos authentication failure error message:
"The target principal name is incorrect. Cannot generate SSPI context"
Server: xxxxx.DomainB.com (this is a standalone SQL Server)
SQL Server Instance: xxxxx\SQLx (version 2014)
SQL Server Domain: DomainB
Current Windows Server for Domain Controllers (Windows Server 2012). This was an upgrade. This is the only change.
Let us collect some data to see what went wrong
As I typically do, after verifying the error message from SSMS, I first wanted to make sure, we have the correct SPNs. So I ran the command (in SQL Server machine, we have the tool so it was ok to run here):
SETSPN -L <SQL Server Instance Service Account>
This shows that we have the required SPN. Since we have the SPNs, I did not go after using Kerberos Configuration Manager, which is another way to verify SPNs and add them if we do not have one. This sure saves a lot of time.
I also use the LDIFDE to make sure we do not have duplicates. So ran, the following command in both user DC and Server DC:
LDIFDE -f check_SPN.txt -t 3268 -d "" -l servicePrincipalName -r "(servicePrincipalName= MSSQLSvc/XXXXXXXX*)" -p subtree
Where XXXXXXXX is our SQL Server Machine when instance SQLx is running.
When we ran it against the SQL Server Machine domain (DomainB), we see that the same SPNs (as from SEPSPN -L output) as we would expect, however, when we ran it against user domain (DomainA), there is no output.
Another most common tool I use is SSPIClient and I leveraged this in this scenario in an effort to get more information. It gave me some additional information, when I ran the tool from the client machine, trying a connection test against the SQL instance xxxxx\SQLx. Two messages are of interest:
SubStatus=0xc000005e -> There are currently no logon servers available to service the logon request
SPN MSSQLSvc/XXXXXXXX .DomainB.com:yyyy not found anywhere in Active Directory
yyyy is the port number where SQL Server is currently listening on, configured as a static port for the instance SQLx
The client was looking for the SPN @DomainA.
Lastly I leveraged the netmon tool as well to double check. It shows also that we are looking for the SPN "MSSQLSvc/XXXXXXXX .DomainB.com:yyyy", @DomainA , which is should not, typically speaking.
I also verified that netlogon server for the machine where the user is logged on is different from the netlogon server for the SQL Server. This was important because sometimes because they are talking to different netlogon server, which may not be in sync with AD objects.
Based on the above findings it made sense to me that we are likely having a domain trust issue, as I thought for a successful Kerberos flow, we do need a two way trust. I also had "two different logon server" clue in mind, which could be an issue. Since I am not expert in Active Directory matters, I sought some help to get to the bottom of this behavior.
First attempt resolving the issue
With the help from our Active Directory expert, we first thought it was a permission issue. The following KB helped in first attempt resolving:
We provided "Read all properties" and "Write all properties" to SELF on the SQL Server service account. This was done on the SQL Server machine domain controller (DomainB)
This seemed helped. We are now able to connect to SQL Server from the test client machine with one user.
Note: I found that with only outbound trust, the authentication flows ok, just need to have proper permissions. May be I need to dig a little more on this part.
The issue came back though
Unfortunately the stubborn Kerberos error continues with other users or it came back after some time.
Renewed effort to finding the root cause and resolve it for all
We continued our digging to find the root cause of the issue.
We first ran the command nltest /verify:xxxxx.DomainB.com to verify trust on the client.
This gave us:
...no such domain exits/no logon servers were available with status code 1355.
This is the similar message as we have seen for our SSPIClient trace test (SubStatus=0xc000005e -> There are currently no logon servers available to service the logon request)
We looked at the local cached Kerberos ticket (ran "klist" from command line in the client), it does not show the Kerberos ticket for our SQL Server instance.
We then ran the following commands from command line and took a netmon trace in the client machine filtered for DNS (to formulate an SPN we do need forward and reverse lookup of the host, so DNS queries need to be tracked), when testing connection to SQL Server from the client machine.
- ipconfig /flushdns
- klist purge
The trace shows DNS query for SQL Server has been triggered, it reached to DC, however it could not resolve the name.
Checking the msdcs for the SQL Server, we found a wrong entry being populated for the SQL Server machine. So we corrected the entry in the logon server DC (for the client) and then replicated the change to all involved DCs. We first had problem in getting the zone information so we re-created the zone and received the updates of the zone. This finally did the trick for us, the issue now got resolved at this point, to the completion. All users are now able to connect over Kerberos.