Breaking Down 18056

We have had two blog posts on this blog regarding the 18056 error.  Two from Bob Dorr (and part 2) and another from Tejas Shah.  However, we still see a lot of questions about this error message. This error message can show up for different reasons.  After those two blog posts were made, we released the following:

FIX: Errors when a client application sends an attention signal to SQL Server 2008 or SQL Server 2008 R2
https://support.microsoft.com/kb/2543687

This fix was specific to the following message and having to do with Attentions:

Error: 18056, Severity: 20, State: 29. The client was unable to reuse a session with <SPID>, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

Since this was released, there has still continued to be confusion over this error.  The intent of the fix above was to limit the amount of noise in the ERRORLOG.  And, this was specific to receiving the State 29 with 18056 when an Attention was received.  The Attention is the important part here.  If an Attention occurred during a reset of a connection, we would normally log that to the ERRORLOG under the State 29.  However, with this fix applied, if the Attention occurs during the reset of a connection, you should no longer see the error within the ERRORLOG.  This does NOT mean that you will no longer see a State 29. 

I will use this post to explain further how we handle these errors to give you a better understanding.  To do that, I will expand on Bob Dorr's blog post that I linked above which lists out the states.  

States

Default = 1,
GetLogin1, 2
UnprotectMem1, 3
UnprotectMem2, 4
GetLogin2, 5
LoginType, 6
LoginDisabled, 7
PasswordNotMatch, 8
BadPassword, 9
BadResult, 10
FCheckSrvAccess1, 11
FCheckSrvAccess2, 12
LoginSrvPaused, 13
LoginType, 14
LoginSwitchDb, 15
LoginSessDb, 16
LoginSessLang, 17
LoginChangePwd, 18
LoginUnprotectMem, 19
RedoLoginTrace, 20
RedoLoginPause, 21
RedoLoginInitSec, 22
RedoLoginAccessCheck, 23
RedoLoginSwitchDb, 24
RedoLoginUserInst, 25
RedoLoginAttachDb, 26
RedoLoginSessDb, 27
RedoLoginSessLang, 28
RedoLoginException, 29 (Kind of generic but you can use dm_os_ring_buffers to help track down the source and perhaps –y. Think E_FAIL or General Network Error)
ReauthLoginTrace, 30
ReauthLoginPause, 31
ReauthLoginInitSec, 32
ReauthLoginAccessCheck, 33
ReauthLoginSwitchDb, 34
ReauthLoginException, 35

**** Login assignments from master ****

LoginSessDb_GetDbNameAndSetItemDomain, 36
LoginSessDb_IsNonShareLoginAllowed, 37
LoginSessDb_UseDbExplicit, 38
LoginSessDb_GetDbNameFromPath, 39
LoginSessDb_UseDbImplicit, 40 (We can cause this by changing the default database for the login at the server)
LoginSessDb_StoreDbColl, 41
LoginSessDb_SameDbColl, 42
LoginSessDb_SendLogShippingEnvChange, 43

**** Connection String Values ****

RedoLoginSessDb_GetDbNameAndSetItemDomain, 44
RedoLoginSessDb_IsNonShareLoginAllowed, 45
RedoLoginSessDb_UseDbExplicit, 46 (Data specified in the connection string Database=XYX no longer exists)
RedoLoginSessDb_GetDbNameFromPath, 47
RedoLoginSessDb_UseDbImplicit, 48
RedoLoginSessDb_StoreDbColl, 49
RedoLoginSessDb_SameDbColl, 50
RedoLoginSessDb_SendLogShippingEnvChange, 51

**** Common Windows API Calls ****

ImpersonateClient, 52
RevertToSelf, 53
GetTokenInfo, 54
DuplicateToken, 55
RetryProcessToken, 56
LoginChangePwdErr, 57
WinAuthOnlyErr, 58

**** New with SQL 2012 ****

DbAuthGetLogin1, 59
DbAuthUnprotectMem1, 60
DbAuthUnprotectMem2, 61
DbAuthGetLogin2, 62
DbAuthLoginType, 63
DbAuthLoginDisabled, 64
DbAuthPasswordNotMatch, 65
DbAuthBadPassword, 66
DbAuthBadResult, 67
DbAuthFCheckSrvAccess1, 68
DbAuthFCheckSrvAccess2, 69
OldHash, 70
LoginSessDb_ObtainRoutingEnvChange, 71
DbAcceptsGatewayConnOnly, 72

Pooled Connections

An 18056 error can only occur when we are trying to reset a pooled connection. Most applications I see these days are setup to use pooled connections. For example, a .NET application will use connection pooling by default. The reason for using pooled connections are to avoid some of the overhead of creating a physical hard connection.

With a pooled connection, when you close the connection in your application, the physical hard connection will stick around. When the application then goes to open a connection, using the same connection string as before, it will grab an existing connection from the pool and then reset the connection.

When a connection is reset, you will not see sp_reset_connection over the wire. You will only see the "reset connection" bit set in the TDS Packet Header.

Frame: Number = 175, Captured Frame Length = 116, MediaType = ETHERNET
+ Ethernet: Etype = Internet IP (IPv4),DestinationAddress:[00-15-5D-4C-B9-60],SourceAddress:[00-15-5D-4C-B9-52]
+ Ipv4: Src = 10.0.0.11, Dest = 10.0.0.130, Next Protocol = TCP, Packet ID = 18133, Total IP Length = 102
+ Tcp: [Bad CheckSum]Flags=...AP..., SrcPort=59854, DstPort=1433, PayloadLen=62, Seq=4058275796 - 4058275858, Ack=1214473613, Win=509 (scale factor 0x8) = 130304
- Tds: SQLBatch, Version = 7.3 (0x730b0003), SPID = 0, PacketID = 1, Flags=...AP..., SrcPort=59854, DstPort=1433, PayloadLen=62, Seq=4058275796 - 4058275858, Ack=1214473613, Win=130304
- PacketHeader: SPID = 0, Size = 62, PacketID = 1, Window = 0
PacketType: SQLBatch, 1(0x01)
Status: End of message true, ignore event false, reset connection true, reset connection skip tran false
Length: 62 (0x3E)
SPID: 0 (0x0)
PacketID: 1 (0x1)
Window: 0 (0x0)
- TDSSqlBatchData:
+ AllHeadersData: Head Type = MARS Header
SQLText: select @@version

In the above example, we are issuing a SQL Batch on a pooled connection. Because it was a pooled connection, we have to signal that we need to reset the connection before the Batch is executed. This is done via the "reset connection" bit.

After the above SQLBatch is issued, the app could then turn around and issue an Attention to cancel the request. This is what resulted in the 18056 with State 29 in the past under the condition of an attention.

Frame: Number = 176, Captured Frame Length = 62, MediaType = ETHERNET
+ Ethernet: Etype = Internet IP (IPv4),DestinationAddress:[00-15-5D-4C-B9-60],SourceAddress:[00-15-5D-4C-B9-52]
+ Ipv4: Src = 10.0.0.11, Dest = 10.0.0.130, Next Protocol = TCP, Packet ID = 18143, Total IP Length = 48
+ Tcp: [Bad CheckSum]Flags=...AP..., SrcPort=59854, DstPort=1433, PayloadLen=8, Seq=4058275858 - 4058275866, Ack=1214473613, Win=509 (scale factor 0x8) = 130304
- Tds: Attention, Version = 7.3 (0x730b0003), SPID = 0, PacketID = 1, Flags=...AP..., SrcPort=59854, DstPort=1433, PayloadLen=8, Seq=4058275858 - 4058275866, Ack=1214473613, Win=130304
- PacketHeader: SPID = 0, Size = 8, PacketID = 1, Window = 0
PacketType: Attention, 6(0x06)
Status: End of message true, ignore event false, reset connection false, reset connection skip tran false
Length: 8 (0x8)
SPID: 0 (0x0)
PacketID: 1 (0x1)
Window: 0 (0x0)

In this case, we would still be in the process of doing the connection reset which would be a problem. Bob Dorr's Part 2 blog that is linked above goes into good detail for how this actually occurs.

So, no more State 29?

The thing to realize about State 29 is that it is a generic state just indicating that an exception has occurred while trying to redo a login (Pooled Connection). This exception was not accounted for in any other logic to produce a different state that is listed above. Something similar to like an E_FAIL or General Network Error.

Going forward, assuming you the above fix applied, or are running on SQL 2012 which has it as well, if you get a State 29, it will not be because of an Attention because we are not logging the 18056 any longer for the Attention, however, if you look at dm_os_ring_buffers, you will still see the actual Attention (Error 3617). We just don't log the 18056 any longer to avoid noise.

<Record id= "3707218" type="RING_BUFFER_EXCEPTION" time="267850787"><Exception><Task address="0x52BDDC8"></Task><Error>3617</Error><Severity>25</Severity><State>23</State><UserDefined>0</UserDefined></Exception><Stack

There are things that occur in the course of resetting a login that could trigger a State 29. One example that we have seen is a Lock Timeout (1222).

In the Lock Timeout scenario, the only thing logged to the ERRORLOG was the 18056. We had to review the dm_os_ring_buffers DMV to see the Lock Timeout.

<Record id= "3707217" type="RING_BUFFER_EXCEPTION" time="267850784"><Exception><Task address="0x4676A42C8"></Task><Error>1222</Error><Severity>16</Severity><State>55</State><UserDefined>0</UserDefined></Exception><Stack

The Lock Timeout was a result of statements issuing "SET LOCK_TIMEOUT 0" which affects the connection itself. When the connection is "reset", the SET statements are carried forward. Then based on timing, and whether an exclusive lock is taken based on what the Login logic is looking for, it could end up affecting Logins off of a Pooled Connection when that connection is reused. The default lock timeout for a connection is -1.

Now what?

If you receive a State 29, you should follow that up by looking in the dm_os_ring_buffers. You will want to look at the RING_BUFFER_EXCEPTION buffer type.

select cast(record as XML) as recordXML
from sys.dm_os_ring_buffers
where ring_buffer_type = 'RING_BUFFER_EXCEPTION'

The error that you find should help explain the condition, and/or allow you to troubleshoot the problem further. If you see 3617, then you will want to look at applying the hotfix above to prevent those messages from being logged. If you see a different error, then you may want to collect additional data (Profiler Trace, Network Trace, etc…) to assist with determining what could have led to that error.

 

Adam W. Saxton | Microsoft Escalation Services
https://twitter.com/awsaxton