Troubleshoot Connectivity/Login failures (18456 State x) with SQL Server

I know that there are lot of articles over the internet world with this topic... But this article is from my perspective which has detailed troubleshooting steps...

The main problem with troubleshooting connectivity issues with SQL Server is because the error message returned to client is almost generic and incase of login failures, the state is always 1

So let me give you step-by-step approach in troubleshooting connectivity issues with SQL Server.

Here is a flow of how connection to SQL Server is made:

1. Protocol used for making connections to SQL Server

Connecting is made using a protocol based on the "Client Protocols" order specified in your local box.

In the case of screenshot given above, the connection made to any SQL Server from this client machine will use "Shared Memory" protocol first and if that connection fails, it will try to use "TCP/IP" protocol and if that fails, connection request will be made using "Named Pipes" protocol. This is because I have all three protocols "Enabled" and I have specified the order in this way.

Shared Memory protocol can be used only for local server connections whereby SQL Server should be running in the same box where you are trying connect. This protocol will not help you to connect to a Clustered SQL Server instance because the IP address of SQL Server is different from your local node.

You can also force to use specific protocol using syntax:

TCP:SQLSRVRNAME\INSTANCE for forcing connections to use TCP/IP protocol

NP:SQLSRVNAME\INSTANCE for forcing connections to use Named Pipe protocol instead you can also use \\.\pipe\instancename\sql\query

LPC:SQLSRVNAME\INSTANCE for forcing connections to use Shared Memory protocol. LPC stands for Local Procedure Call.

If you force these protocols, then connectivity will happen only using specific protocol and if that fails, connection will fail without trying with further protocols. If you are connecting using a SQL Server alias created in the local client, then the protocol specified in the alias will only be used.

2. Connecitivity flow when using TCP/IP Protocol

While making connections using TCP/IP protocol, the client driver will check whether the instance is DEFAULT instance (MSSQLSERVER is the instance name for default instance) and if yes, the connection is made directly to port 1433 using TCP protocol to the target SQL Server Instance.

For ex: If SQLMOSS is the instance name and if I connect from SQL Server Management Studio to SQLMOSS, a connection request is sent to TCP port 1433 for the IP Address returned by DNS to the hostname SQLMOSS.

If this connection fails, a request is sent to port 1434 over UDP protocol and this is the port and protocol in which SQL Server Browser will be listening for incoming requests in the target SQL Server instance SQLMOSS. Now SQL Server Browser would have already read the port in which requested  SQL Server Instance SQLMOSS\MSSQLSERVER is listening from the registry. So SQL Server Browser knows the TCP port is say 1488, it will return this information back to the requested client that SQLMOSS instance is listening on port 1488.

Now the client will reconnect to SQLMOSS instance using the TCP port 1488 and provided there is not firewall blocking, this connection will succeed.

There are other things like authentication and authorization to complete a login successfully.

3. Authentication:

It means that the username and password you specified in the connection is valid. If you are using Windows authentication and SQL Server server running with any of two mode "Mixed Mode" or "Windows-only" authentication, SQL Server will request Local Security Authority Subsystem Service (LSASS) to verify the credentials of the specified user. Again if the windows account is a local machine account, it is verified against local system security database and if it is a domain account, LSASS then requests Active Directory to authenticate the user.

Once this user is authenticated, a token is then passed to SQL Server that authentication is successfull. There are again two things in this authentication, they are NTLM or KERBEROS. KERBEROS needs SQL Server Service Principal Name to be registered in Active Directory agains the SQL Server node name or the SQL Server Virtual Server Name. If SPN's are not registered, then connection to failback to NTLM depending on your environment settings. For more info about NTLM & Kerberos, refer this article

If the login is a SQL Server login, then SQL Server takes care of authenticating the user because SQL Server stores the username and password. If target SQL Server instance is running in Windows-Only authentication mode then though you are specifying correct password for the SQL login, the connection will fail because SQL Server will not allow Non Windows login. You can change this to make SQL Server to allow Mixed Mode Authentication (Both SQL logins and Windows logins) from SSMS -> Instance -> Properties -> Security -> Server Authentication. Please note that you need to restart SQL Server if you make this change.

4. Authorization:

Once authentication succeeds, authorization phase starts whereby SQL Server checks the permission for accessing the target database based on the TokenPerm cache built and if yes, the login succeeds and if not the connection will fail with State 16.

Once authorization completes, you can submit your queries to execute in the SQL Server instance.

So by now, hope you have the complete flow of how connectivity happens from a client to SQL Server.

Here are some Troubleshooting tips:

1. To make sure SQL Server Browser is able to start (Port 1434 is available for SQL Server Browser to grab), you can try start SQL Server Browser from command line :

C:\Program Files\Microsoft SQL Server\90\Shared>sqlbrowser.exe -c

You should see something like:

 SQLBrowser: starting up in console mode

SQLBrowser: starting up SSRP redirection service

SQLBrowser is successfully listening on ::[1434]

SQLBrowser is successfully listening on 0.0.0.0[1434]

[9088]: Waiting for next request...

This step will ensure that SQL Browser works fine. If you are able to start SQL Server Browser from command line and not as a service, check the service account configured for SQL Server Browser. To shutdown SQL Browser started from command line, you have to press CTRL + C.

2. If SQL Server Browser failed to start, run TCPView from https://technet.microsoft.com/en-us/sysinternals/bb897437 to make sure that no other process is already listening on 1434 UDP. For example if you have SQL Server 2000 already installed on the same node where SQL Server Browser fails to come online, it could be very well because SQL 2000 SSRP service already listening on 1434 and TCPView will help you to troubleshoot this.

3. If connectivity to SQL Server instance fails, first thing to check is whether it is a Clustered SQL Server instance or a Standalone.

If it is a clustered, try connecting from Active node of SQL Server to the SQL Server Instance. If this works then it could be a firewall blocking connections from passive node and other clients.

To confirm that firewall is blocking the connectivity, use PortQry from https://www.microsoft.com/downloads/en/details.aspx?FamilyID=8355e537-1ea6-4569-aabb-f248f4bd91d0&displaylang=en

Here is a sample output when I connect to a server name SQLMOSS (This should be a node name if it is a standalone instance and virtual server name if it is clustered instance)

  =============================================
Starting portqry.exe -n SQLMOSS
-e 1434 -p UDP ...

Querying target system called:   SQLMOSS

 Attempting to resolve name to IP address...

Name resolved to 192.168.100.3

querying...

UDP port 1434 (ms-sql-m service): LISTENING or FILTERED

Sending SQL Server query to UDP port 1434...

Server's response:  

ServerName SQLMOSS
InstanceName MSSQLSERVERIsClustered No
Version 9.00.4035.00
tcp 1433

==== End of SQL Server query response ====

UDP port 1434 is LISTENING
portqry.exe -n SQLMOSS -e 1434 -p UDP exits with return code 0x00000000.
 

If you don't get any instance names returned in the result set, it should be firewall blocking the traffic. It will be a good step to stop Firewall service and give it a try. If you are interested in adding Exception to firewall, add exception for Port 1434-UDP for SQL Browser and Port xxxx-TCP for SQL Server. You can get the port in which SQL Server is listening from SQL Server Errorlog.

If you don't have permission to install/download PortQry, try using telnet program available inbuilt in Windows (For Win > 2008, you need to add Telnet Client Feature from Server Manager). If SQL Server is listening on port 1488 then when I run telnet SQLMOSS 1448 from command prompt, I got a blank screen like below which indicates that SQL Server is listening to my request:

 If this step fails, contact your system administrator with this data so that he can check the firewall logs.

4. If you get login failures, first thing to check is the default trace because default trace gives you the reason for login failure in plain english. Query below will help you to get all Login failures recorded in the current default trace. This information is captured by default in any SQL Server 2005, 2008 instances.

DECLARE @trcpath nvarchar(256)

SELECT  @trcpath=CAST(value as nvarchar(256)) FROM  fn_trace_getinfo(default)  WHERE  property = 2

 SELECT  * FROM fn_trace_gettable (@trcpath,default ) WHERE  EventClass= 20 ORDER BY starttime DESC -- Change "default" to 1 if you want to read information only from current trace file

 In my case, I got TextData like this

Login failed for user 'del'. Reason: Failed to open the database specified in the login properties. [CLIENT: <local machine>]
Login failed for user 'sa'. Reason: Password did not match that for the login provided. [CLIENT: <local machine>]

5. Use the information recorded in Ring_Buffer (Available in SQL 2005 SP2 and higher versions) to find details about the login failure. You can run the query (Source) below to get the error code, time of login failure, API name under the context, Error code returned by Windows API. Once you have the error code, download the err.exe and translate this error code into a meaningful error message. You can also convert the hex error code into a decimal value and use NET HELPMSG <ERRORNUMBER> to get description about the error. Refer my example after this code on how I used err.exe.

 SELECT dateadd (ms, (a.[Record Time] - sys.ms_ticks), GETDATE()) as [Notification_Time], a.* FROM 
(SELECT 
x.value('(//Record/@id)[1]', 'bigint') AS [Record_ID], 
x.value('(//Record/Error/ErrorCode)[1]', 'varchar(30)') AS [ErrorCode], 
x.value('(//Record/Error/APIName)[1]', 'varchar(255)') AS [APIName], 
x.value('(//Record/Error/CallingAPIName)[1]', 'varchar(255)') AS [CallingAPIName], 
x.value('(//Record/Error/SPID)[1]', 'int') AS [SPID], 
x.value('(//Record/@time)[1]', 'bigint') AS [Record Time] 
FROM (SELECT CAST (record as xml) FROM sys.dm_os_ring_buffers 
WHERE ring_buffer_type = 'RING_BUFFER_SECURITY_ERROR') AS R(x)) a 
CROSS JOIN sys.dm_os_sys_info sys 
ORDER BY a.[Record_ID] DESC

The output of above query showed me an errorcode 0x8C2. I ran Err.exe 0x8C2 and this is what I got:

# for hex 0x8c2 / decimal 2242 :
NERR_PasswordExpired lmerr.h

I used NET HELPMSG 2242 (2242 is the decimal equivalent of 0x8C2) and this is what I got:

The password of this user has expired

EXPLANATION

Your password has expired. You will not be able to perform any network tasks unt
il you change your password.

ACTION

To change your password, press Ctrl+Alt+Del and then select Change Password.

Other ring buffer error codes I have seen:

0x89B - NERR_BadPassword

0x8 - ERROR_NOT_ENOUGH_MEMORY (Not enough storage is available to process this command)

0x534 - ERROR_NONE_MAPPED (No mapping between account names and security IDs was done),

For CAPI errors (Error codes like 0x800900xx), Search in SQL Server Security forum for references

Note: You can use TF 4612 to disable ring buffer logging and TF 4613 to generate a minidump file whenever an entry is logged into the ring buffer. (Source)

6. Enable Trace flag 4029 in the startup parameter (Not DBCC TRACEON) which records detailed error message for login failures. For Ex:

Network error code 0x40 occurred while establishing a connection; the connection has been closed. This may have been caused by client or server login timeout expiration. Time spent during login: total 10438 ms, enqueued 1 ms, network writes 0 ms, network reads 10438 ms, establishing SSL 0 ms, negotiating SSPI 0 ms, validating login 0 ms. [CLIENT: x.x.x.x]
...
Network error code 0x2746 occurred while establishing a connection; the connection has been closed. This may have been caused by client or server login timeout expiration. Time spent during login: total 5038 ms, enqueued 1 ms, network writes 1 ms, network reads 5038 ms, establishing SSL 5038 ms, negotiating SSPI 0 ms, validating login 0 ms. [CLIENT: x.x.x.x]

We can then converted the highlighted hexadecimal code to decimal -> Run NET HELPMSG <decimal> value to find out the exact reason for login failure. Also section explaining time spent will give some hints.

7. You can use the Extended Events infrastructure to collect the login failures. I have attached a T-SQL to help you with this.

So it is that simple. You can also use traditional approach of decoding state information manually. For this, you have to check whether "Failed logins" auditing is enabled in SQL Server (SSMS -> Instance > Properties -> Security -> Login Auditing) and if yes, review the SQL Server error log to find the state number in the login failure message. If this Login auditing is not capturing the failed logins, you can modify it to enable failed login auditing. Now try to login again and once it failes, refer to the SQL Server error log, find the state information and  decode them using details provided in https://blogs.msdn.com/b/sql_protocols/archive/2006/02/21/536201.aspx

For your reference, I'm giving some of the most familiar ones:

Error: 18456, Severity: 14, State: 1 Reason: General error message returned to the client. Refer SQL error log or SQL Server Event log or SQL Server default trace for specific state number.

Error: 18456, Severity: 14, State: 2 Reason: Invalid user ID

Error: 18456, Severity: 14, State: 5 Reason: Could not find a login matching the name provided.

Error: 18456, Severity: 14, State: 6 Reason: Attempting to use an NT account name with SQL Server Authentication

Error: 18456, Severity: 14, State: 7 Reason: The account is disabled

Error: 18456, Severity: 14, State: 8 Reason: Password did not match that for the login provided.

Error: 18456, Severity: 14, State: 9 Reason: Invalid Password

Error: 18456, Severity: 14, State: 10 Refer https://support.microsoft.com/kb/925744.

Error: 18456, Severity: 14, State: 11 Reason: Valid login but server access failure. Refer this article for detailed troubleshooting steps (You might also get this error when the windows login is not added to SQL Server or if UAC is enabled )

Error: 18456, Severity: 14, State: 12 Reason: Valid login but server access failure

Error: 18456, Severity: 14, State: 13 Reason: SQL Server service is paused. No new connections can be accepted at this time.

Error: 18456, Severity: 14, State: 16 Reason: User does not have permissions to log into the target database

Error: 18456, Severity: 14, State: 18 Reason: Password Expired. Need to change the password for the user

Error: 18456, Severity: 14, State: 23 Refer https://support.microsoft.com/kb/937745

Error: 18456, Severity: 14, State: 38 Reason: Failed to open the explicitly specified database.

Error: 18456, Severity: 14, State: 40 Reason: Failed to open the database specified in the login properties (Database unavailable)

Error: 18456, Severity: 14, State: 46 Reason: Database explicitly specified in the connection string is not accessible

Error: 18456, Severity: 14, State: 58 Reason: An attempt to login using SQL authentication failed. Server is configured for Windows authentication only.

If the state value you are looking for is missing here, check https://blogs.msdn.com/b/psssql/archive/2010/08/03/how-it-works-error-18056-the-client-was-unable-to-reuse-a-session-with-spid-which-had-been-reset-for-connection-pooling.aspx to see whether you get an answer for your state.

Please note that functionalities and terminologies explained in this article are not so detailed. For detailed architecture on how connectivity works, refer BOL.

Login_Failures_using_XEVENTS.sql