Non-yielding IOCP Listener - Stack Dump analysis

 

Here's something I was looking at yesterday which is worth a quick look. It's quite a common scenario and one that's relatively easy to address. Your SQL Server 2005 instance suddenly throws the following to the error log (apparently without warning):

2008-09-09 08:30:44.810 Server Using 'dbghelp.dll' version '4.0.5'
2008-09-09 08:30:44.960 Server **Dump thread - spid = 0, PSS = 0x0000000000000000, EC = 0x0000000000000000
2008-09-09 08:30:45.000 Server ***Stack Dump being sent to F:\MSSQL.1\MSSQL\LOG\SQLDump0001.txt
2008-09-09 08:30:45.000 Server * *******************************************************************************
2008-09-09 08:30:45.010 Server *
2008-09-09 08:30:45.010 Server * BEGIN STACK DUMP:
2008-09-09 08:30:45.010 Server * 09/09/08 08:30:45 spid 0
2008-09-09 08:30:45.010 Server *
2008-09-09 08:30:45.010 Server * Non-yielding IOCP Listener
2008-09-09 08:30:45.010 Server *
2008-09-09 08:30:45.010 Server * *******************************************************************************
2008-09-09 08:30:45.010 Server * -------------------------------------------------------------------------------
2008-09-09 08:30:45.010 Server * Short Stack Dump
2008-09-09 08:30:45.320 Server Stack Signature for the dump is 0x000000000000020B
2008-09-09 04:47:06.180 spid3s LazyWriter: warning, no free buffers found.                                      

Following this several hundred lines of output similar to the following format are shown

2008-09-09 08:31:08.750 spid3s
Memory Manager
VM Reserved = 16963064 KB
VM Committed = 16942872 KB
AWE Allocated = 0 KB
Reserved Memory = 1024 KB
Reserved Memory In Use = 0 KB
2008-09-09 08:31:08.750 spid3s
Memory node Id = 0
VM Reserved = 16957304 KB
VM Committed = 16937200 KB
AWE Allocated = 0 KB
SinglePage Allocator = 2519016 KB
MultiPage Allocator = 39632 KB
2008-09-09 08:31:08.750 spid3s
MEMORYCLERK_SQLGENERAL (Total)
VM Reserved = 0 KB
VM Committed = 0 KB
AWE Allocated = 0 KB
SM Reserved = 0 KB
SM Committed = 0 KB
SinglePage Allocator = 48576 KB
MultiPage Allocator = 5136 KB

You may well recognize that this latter output is basically the same as you would get if you run:
DBCC MEMORYSTATUS

You'll also have noted that a small dump file has been produced in the default error log directory of this SQL instance (the path of which is noted in the error log output above).

In normal circumstances you can submit this dump file to us in support, and we can take a look at it and, all being well tell you what the problem is. The thing you have to remember here is that the production of the dump file does not necessarily mean that you have encountered a bug in SQL Server (as some people often believe). In the more recent versions of SQL there are innumerable scenarios where we control and catch certain scenarios well known to us and produce a dump file automatically.

This particular scenario was caused by inappropriate configuration of SQL Server memory settings, and should be easily rectified, so I thought it a good example to show why.

Although there is little that can normally be gleaned from debugging a SQL Server dump in the public domain, this one does contain a few hints. We don't publish private symbols for SQL Server, meaning that the amount you can do yourself with a dump file is quite limited (since you only have access to retail symbols), however you can look at the thread stacks and see if anything of note can be seen. Anyone who has read Ken Henderson's fantastic book about SQL Server internals will have used these techniques before. Sometimes it's just interesting to look a little deeper after all.

I'll assume that you're already familiar with installing and configuring the debugger, but if you're not here's a good place to start. Debugging Tools and Symbols: Getting Started. This example just uses the retail symbol server address from this page:

SRV*c:\websymbols*https://msdl.microsoft.com/download/symbols

If you open this dump, you'll find yourself starting with the context set to the thread that generated the dump, in this case:

0:004> kn
# Child-SP RetAddr Call Site
00 00000000`0474de58 00000000`77d704bf ntdll!NtWaitForSingleObject+0xa
01 00000000`0474de60 00000000`016be0f9 kernel32!WaitForSingleObjectEx+0x130
02 00000000`0474df00 00000000`016bdaeb sqlservr!CDmpDump::DumpInternal+0x4a3
03 00000000`0474dff0 00000000`01c7c7b3 sqlservr!CDmpDump::Dump+0x3b
04 00000000`0474e040 00000000`01d3af67 sqlservr!CImageHelper::DoMiniDump+0x413
05 00000000`0474e1a0 00000000`025f8cf5 sqlservr!stackTrace+0x6e7
06 00000000`0474f6d0 00000000`01a3b025 sqlservr!SQL_SOSNonYieldIOCPCallback+0x1ef
07 00000000`0476f7e0 00000000`018662b6 sqlservr!SOS_OS::ExecuteNonYieldIOCPCallbacks+0x44
08 00000000`0476f830 00000000`010b9628 sqlservr!alloca_probe+0x220815
09 00000000`0476f990 00000000`029002c9 sqlservr!SchedulerMonitor::Run+0x164
0a 00000000`0476fac0 00000000`010079fe sqlservr!SchedulerMonitor::EntryPoint+0x9
0b 00000000`0476faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
0c 00000000`0476fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
0d 00000000`0476fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
0e 00000000`0476fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
0f 00000000`0476fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x59
10 00000000`0476fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
11 00000000`0476fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
12 00000000`0476ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
13 00000000`0476ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84

This is the thread that is actually doing the monitoring and generating the dump. You can see in frame 02 the name of the function doing this. If you had the private symbols at this point, you can look through some of the earlier frames and determine which thread caused this condition to occur, but I can't show that here :-)

The thread in question here was actually thread 8, we can switch to this and take a look

0:008> kn 50
# Child-SP RetAddr Call Site
00 00000000`04f6f008 00000000`01750bd0 sqlservr!NumaNode::RampUp
01 00000000`04f6f010 00000000`012f1523 sqlservr!alloca_probe+0xbf420
02 00000000`04f6f110 00000000`0103f136 sqlservr!BPool::Steal+0x3ec
03 00000000`04f6f1c0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x25
04 00000000`04f6f200 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce
05 00000000`04f6f4a0 00000000`0103f7ba sqlservr!MemoryClerkInternal::AllocatePages+0x6e
06 00000000`04f6f500 00000000`010d4bb4 sqlservr!CMemThread::PbGetNewPages+0x6a
07 00000000`04f6f540 00000000`01068bba sqlservr!CFixedPageMgr::PbAllocate+0xcd
08 00000000`04f6f5b0 00000000`01004d37 sqlservr!CMemThread::Alloc+0x1d7
09 00000000`04f6f640 00000000`0147dc3a sqlservr!operator new+0x26
0a 00000000`04f6f680 00000000`01045589 sqlservr!SOS_Node::EnqueueTaskDirectInternal+0x587
0b 00000000`04f6f730 00000000`01045b03 sqlservr!SOS_Node::EnqueueTask+0x49
0c 00000000`04f6f780 00000000`01046581 sqlservr!CNetConnection::EnqueuePacket+0x8f
0d 00000000`04f6f890 00000000`01045cab sqlservr!TDSSNIClient::ReadHandler+0x171
0e 00000000`04f6f900 00000000`010459bc sqlservr!SNIReadDone+0xb7
0f 00000000`04f6f960 00000000`010079fe sqlservr!SOS_Node::ListenOnIOCompletionPort+0x251
10 00000000`04f6faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
11 00000000`04f6fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
12 00000000`04f6fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4
13 00000000`04f6fd00 00000000`014789ca sqlservr!SchedulerManager::WorkerEntryPoint+0xe7
14 00000000`04f6fda0 00000000`01478911 sqlservr!SystemThread::RunWorker+0x59
15 00000000`04f6fde0 00000000`014795d9 sqlservr!SystemThreadDispatcher::ProcessWorker+0x130
16 00000000`04f6fe80 00000000`781337a7 sqlservr!SchedulerManager::ThreadEntryPoint+0x128
17 00000000`04f6ff20 00000000`78133864 msvcr80!_callthreadstartex+0x17
18 00000000`04f6ff50 00000000`77d6b6da msvcr80!_threadstartex+0x84
19 00000000`04f6ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a

So again, just by looking at the function names, we can make a guess that some internal memory routines related to the buffer pool are going on here, particularly from frames 02 and 03. If you ever attach a debugger into a running SQL Server instance (obviously never in production!) you'd see functions like this occurring regularly as SQL Server manages the buffer pool pages.

There's not much more that can be seen from this stack, so we'll take a look through the rest of the threads, in this case by running

~* kn

When browsing through these threads, we notice lots that also appear to be involved in memory related events, just one example of which is

  53 Id: 554.a6c Suspend: 1 Teb: 000007ff`fffd9000 Unfrozen
# Child-SP RetAddr Call Site
00 00000000`1521e140 00000000`012f1523 sqlservr!BPool::ReplenishFreeList+0x10d
01 00000000`1521e240 00000000`0103f136 sqlservr!BPool::Steal+0x3ec
02 00000000`1521e2f0 00000000`0103f32e sqlservr!SQLSinglePageAllocator::AllocatePages+0x25
03 00000000`1521e330 00000000`0103f1ce sqlservr!MemoryNode::AllocatePagesInternal+0xce
04 00000000`1521e5d0 00000000`0103f676 sqlservr!MemoryClerkInternal::AllocatePages+0x6e
05 00000000`1521e630 00000000`0103f737 sqlservr!CVarPageMgr::PviNewVarPage+0x56
06 00000000`1521e6a0 00000000`0100ca58 sqlservr!CVarPageMgr::PbAllocate+0x3ce
07 00000000`1521e710 00000000`01004d37 sqlservr!CMemObj::Alloc+0xcc
08 00000000`1521e780 00000000`0109b203 sqlservr!operator new+0x26
09 00000000`1521e7c0 00000000`014607a1 sqlservr!operator new[]+0x23
0a 00000000`1521e800 00000000`0145d4c8 sqlservr!CErrorReportingManager::CwchFormatToBuffer+0x180
0b 00000000`1521e970 00000000`0128d202 sqlservr!CErrorReportingManager::CwchCallFormatToBuffer+0x18
0c 00000000`1521e9b0 00000000`0128d0f8 sqlservr!CLoginClientInfo::CLoginClientInfo+0x71
0d 00000000`1521ea30 00000000`01285265 sqlservr!IssueLoginSuccessReport+0x67
0e 00000000`1521eb40 00000000`0128494f sqlservr!login+0x584
0f 00000000`1521f480 00000000`01285610 sqlservr!process_login+0xee
10 00000000`1521f840 00000000`010079fe sqlservr!process_commands+0x40d
11 00000000`1521faf0 00000000`01008362 sqlservr!SOS_Task::Param::Execute+0xee
12 00000000`1521fc00 00000000`01007da1 sqlservr!SOS_Scheduler::RunTask+0xc9
13 00000000`1521fc90 00000000`014787a7 sqlservr!SOS_Scheduler::ProcessTasks+0xb4

We note that sqlservr!BPool::ReplenishFreeList is common to a lot of the threads, and other BPool functions crop up frequently as well.

At this point there is not much else to be gleaned from the dump without private symbols, however it's clear that there appears to be a significant number of threads waiting on what appear to be internal memory routines. In this case, we also have the fact that SQL Server itself has detected that it believes that memory allocations were the cause of the problem, which is why it has dumped out the DBCC MEMORYSTATUS to help us. The error log specifically states that it has "no free buffers".

If you were to look through the whole of the DBCC MEMORYSTATUS and decode some of the latter items using this article as a guide

https://support.microsoft.com/kb/907877

then you would note that the buffer pool has been recently grown, and it is issuing a notification that it wished to grow further.

So what next, is the question; you've noted that SQL Server appears to be having memory management problems, so maybe you think that you should add some more memory to alleviate this? Well in a word - "NO".

The answer to this problem is that we have to look at the configuration of the hardware, and the configuration of SQL Server first (which in any troubleshooting scenario is always a good idea).

What we saw here at the time of the dump was:

MemoryLoad = 95%                    
Total Physical = 16382 MB           
Available Physical = 792 MB    

And in fact over the preceding minutes before the dump we saw even lower memory availability than this, with physical utilization up to 98%. (this is something that we can extract from the dump with private symbols).

So we do appear short of physical memory, but when we run sp_configure on the server, we see this:

name minimum maximum config_value run_value
----------------------------------- ----------- ----------- ------------ -----------
max server memory (MB) 16 2147483647 2147483647 2147483647

Therefore the max server memory setting has not been set. Although SQL will dynamically attempt to manage its memory, there can be circumstances in large memory installations where if it runs unlimited you can get into situations where it ends up fighting with other processes or the OS for memory, and situations like this can occur. This is why we recommend running a "max server memory" setting in this type of scenario, to limit the size of the buffer pool to an amount lower than the total memory on the server itself. The most referenced article on this is probably here

https://blogs.msdn.com/slavao/archive/2006/11/13/q-a-does-sql-server-always-respond-to-memory-pressure.aspx

We see numerous cases revolving around this scenario and derivatives of it, and many can simply be avoided by limiting the memory of the SQL Server buffer pool with this setting, so that you don't ever get into this memory bound scenario. So the bottom line here is change the max server memory setting to something around the 13000 mark, and then monitor your memory usage to see if you're still encountering this type of problem, making further adjustments if necessary.

If you have any questions surrounding this topic, feel free to post them.