These are distinct errors but I have found that many people blend them together. It really takes an entire chapter to explain these conditions so I am going try to do it in a page or two?
An unresolved deadlock is generally a SQL Server product issue. When an unresolved deadlock is detected a message is added to the SQL Server error log and a mini-dump is captured.
The Lock Monitor is responsible for detecting and resolving deadlocks for SQL Server. It does this by building a lock graph (you may have seen partial visualization of this from the deadlock trace event.) As the graph is built the cycles can be detected, indicating a deadlock condition. Once a deadlock condition is detected the Lock Monitor performs the victim selection and signals the abort.
The key to the ‘unresolved deadlock’ error is the detection of a deadlock but a failure while attempting to select the victim to abort.
Why is a target considered unsafe and not eligible to be a victim? The answer is specifically in the term unsafe. Let me provide an example of an unsafe victim. Assume a session:
- Opens a transaction
- Does some work
- Encounters an error
- Begins rollback processing
- Has to wait on a lock that results in a deadlock loop (Should not occur and usually a product bug as the lock should already have been acquired by the transaction.)
The session is already rolling back so it is unsafe to abort the rollback. Any rollback that fails causes SQL Server to take the database offline, killing all active sessions to the database. Once all connections are terminated the database is reopened in order to execute crash recovery. If crash recovery succeeds the database is available for use but if crash recovery fails the database is marked suspect.
Forcing an abort of a command that is already in rollback would not be the behavior you desire on your production server.
It may not be easy but the fastest way to attack an ‘unresolved deadlock’ problem is to reproduce it. Having a reproduction and the associated mini-dump capture allows Microsoft Support and the SQL Server Development to track down the source of the problem.
As I discussed the ‘unresolved deadlock’ topic with some of my peers the ‘scheduler deadlock’ seemed to get intertwined and confused.
The scheduler deadlock is detected by the Scheduler Monitor, not the lock monitor. The scheduler deadlock is designed to detect when the scheduler does not appear to be making forward progress. This is distinctly different from the lock monitor graph. In fact, the algorithm for scheduler deadlock looks at the scheduler’s task queue (work to be processed) and monitors to see if tasks are being picked up in the queue.
The Scheduler Monitor periodically loops over the schedulers (~5 seconds) checking there state while using the number of yields, work processed and other information to determine the health of the scheduler.
The Scheduler Deadlock (17884) check is as follows.
- Does we have tasks waiting in the schedulers queue
- Are all workers in use and created for the scheduler
- Have we processed any new work since the last scheduler check (completed tasks)
After looping over all schedulers, within the node as there is a Scheduler Monitor per node, if all schedulers are declared stuck (not progressing) the Scheduler Deadlock is flagged as a 17884 error condition. If this condition lasts for an unwanted period of time the 17884 is officially declared and logged. If a single resource is responsible for the majority of wait states this information is also logged in the SQL Server error log.
This is usually a sign of long running queries or severe blocking and the fastest way to attack a scheduler deadlock is to login with the dedicated admin connection (DAC) and take a look at sys.dm_exec_requests as well as sys.dm_os_wait* information.
Bob Dorr - Principal Software Engineer SQL Server