SQL Server 2008 R2 New Non-Yield Ring Buffer Information

In 2002 the SQLOS team added specific checks for non-yielding scheduler issues.   You may be familiar with the 178** series of errors like 17883 scheduler non-yield.   Since 2002 the test matrix for SQL Server has flagged these errors and corrected them.   With the evolution of SQL Server 2005, 2008 and now 2008 R2 the number of self-inflicted 178** errors is less than I can count on one hand since 2007.

We are finding that the vast majority of the 178** error conditions are caused by external factors and processes.  So in SQL Server 2008 R2 additional ring buffer entries were added to show system wide information about processes and threads, helping to pinpoint the cause of the problem.

There is now non-yield system information ring buffers that will show things like the following performance values. The data is self-explanatory and is collected with a combination of the performance monitor APIs and Toolhelp.

The data is only output by one scheduler monitor. It does not have to be the Node 0 scheduler monitor but it is protected so only a single scheduler monitor will generate the record and secondary monitors will avoid the production properly.

The frequency of the record collection can increase when a non-yield situation is occurring and each ring buffer currently allows up to 1024 records.

Process information on the system showing common memory usage statistics. If another process is causing excessive memory pressure the information might be revealed by the ring buffer capture.

<NonYieldProcessTable>
  <ProcessID>%d</ProcessID>

  <ProcessName><![CDATA[%ls]]</ProcessName>" ------- Controlled with TRCFLG_RETAIL_COLLECT_PROCESS_NAME (-T1264) to include the EXE file name

<PageFaultCount>%d</PageFaultCount>
<WorkingSetSize>%ld</WorkingSetSize>
<PrivateUsage>%ld</PrivateUsage>
<NonYieldProcessTable>

 Thread level information on the system showing common processing time data points.    
 <NonYieldThreadTable>
 <ProcessID>%d</ProcessID>
 <ThreadID>%d</ThreadID>
 <UserTimeStart>%I64d</UserTimeStart>       ----- user time at the start of the capture

 <UserTimeEnd>%I64d</UserTimeEnd>           ----- user time at the end of the capture   (DIFF of them to see activity during the reporting period)
 <KernelTimeStart>%I64d</KernelTimeStart>
 <KernelTimeEnd>%I64d</KernelTimeEnd>
</NonYieldThreadTable>

If this still does not reveal the cause the mini-dump taken may contain additional information that the Microsoft SQL Server Support team can extract. For example the following data may be available from the time of the non-yield condition.

\PhysicalDisk(_Total)\Avg. Disk Queue Length
\PhysicalDisk(_Total)\% Disk Time
\Memory\Pages/sec
\Memory\Pages Output/sec
\Memory\Pages Input/sec
\Memory\Available Bytes
\Paging File(_Total)\% Usage

Bob Dorr - Principal SQL Server Escalation Engineer