SQL High CPU scenario troubleshooting using sys.dm_exec_query_stats and RING_BUFFER_SCHEDULER_MONITOR ring buffer in sys.dm_os_ring_buffers

I ran into a scenario a while ago that I thought I would share, where we were troubleshooting high CPU on  SQL Server.  The sporadic High CPU ( 90+ % )  pretty much brought the box down to its knees for a period of 5-10 minutes minutes or so, and then cleared itself up. We determined from performance monitor that SQL was the one consuming all of that CPU. Initially I tried to draw a correlation to times that it was happening, but wasn’t able to as the times were random. My initial thought was to capture a profiler and get the culprit and  tune the query in question. In this case though even running a profiler capturing batch level events was pretty expensive as it was a very chatty server and  we had about 350 MB trace files created every minute. Given that we did not know the timeframe the problem would occur, and given the problem period was pretty small running a profiler continually to try to capture an occurrence of the problem was not an option.

We then opted to try the Performance Dashboard for SQL 2005 which is an awesome tool (downloadable from the SQL 2005 Feature Pack), but the primary problem we were facing was that we did not know when the problem would occur, and when we did, by the time we got on any tools the problem would clear itself out. It was random and short lived.

The other thing we knew was that there weren't any long running statements per say as we tried to run some filters to capture statements that took longer than some amount of CPU time and did not come up with much, so this was more the effect of multiple executions of a particular statement during that timeframe than a single execution of a statement.

I reverted to the DMV’s in SQL 2005 to help collect the necessary data and was able to solve the problem just using the DMV's itself . Given that I did not know when the problem could occur and I did not want to capture the data all the time,  I had a script to conditionally capture data based on the CPU information in the Scheduler Monitor ring buffer.  I used the sys.dm_os_ring_buffers DMV and specifically the records of type 'RING_BUFFER_SCHEDULER_MONITOR' which  are captured once every 60 seconds and give you the CPU utilization from the SQL process perspective besides other information. Using the script below, I was able to see if the last 2 ring buffer records had CPU time greater than a particular threshold and if so, I captured the top statements and plans in a table and database that I created. Capturing the DMV information such as this was much less impactful to the system and could be kept on for hours/days unlike profiler in this case.

 

The script below should work for both SQL 2005 and SQL 2008.

-- This script runs in an infinite loop every 30 seconds ( can be invoked with SQLCMD or management Studio)

-- @SQLCPUThreshold_Percent specifies what CPU threshold we are monitoring

-- If for 2 successive Ring buffer snapshots ( 2 minutes), the CPU is above the threshold we specified,

-- We then collect the top 25 Statements along with their plans in the Database and table we created.

-- Database/Table creation script not included.

Declare @SQLCPUThrehold_Percent int

Set @SQLCPUThrehold_Percent = 75

WHILE (1 = 1)

BEGIN

      SELECT TOP 2

      CONVERT (varchar(30), getdate(), 126) AS runtime,

                record.value('(Record/@id)[1]', 'int') AS record_id,

                record.value('(Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') AS system_idle_cpu,

                record.value('(Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') AS sql_cpu_utilization

      into #tempCPU

      FROM sys.dm_os_sys_info inf CROSS JOIN (

      SELECT timestamp, CONVERT (xml, record) AS record

      FROM sys.dm_os_ring_buffers

      WHERE ring_buffer_type = 'RING_BUFFER_SCHEDULER_MONITOR'

      AND record LIKE '%<SystemHealth>%') AS t

      ORDER BY record.value('(Record/@id)[1]', 'int') DESC

 

-- If the IF statement is satisfied, last 2 Ring buffer records had CPU > threshold so we capture the plans

if ( (select count(*) from #tempCPU) = (select count(*) from #tempCPU where sql_cpu_utilization >@SQLCPUThrehold_Percent))

begin

-- insert top 25 Statements and plans by CPU into the table

 

      insert into Tbl_troubleshootingPlans

      SELECT TOP 25 getdate() as runtime,

                        qs.Execution_count as Executions,

                        qs.total_worker_time as TotalCPU,

                        qs.total_physical_reads as PhysicalReads,

                        qs.total_logical_reads as LogicalReads,

                        qs.total_logical_writes as LogicalWrites,

                        qs.total_elapsed_time as Duration,

  qs.total_worker_time/qs.execution_count as [Avg CPU Time],

                                    substring (qt.text,qs.statement_start_offset/2,(case when qs.statement_end_offset = -1 then len (convert (nvarchar(max), qt.text)) * 2

                                    else qs.statement_end_offset end - qs.statement_start_offset)/2) as query_text,

                        qt.dbid as DBID,

                        qt.objectid as OBJECT_ID,

                                    cast ( query_plan as xml) as XMLPlan

      FROM sys.dm_exec_query_stats qs

      cross apply sys.dm_exec_sql_text(qs.sql_handle) as qt

      CROSS APPLY sys.dm_exec_query_plan(plan_handle)

      ORDER BY TotalCPU DESC

 

end

drop table #tempCPU

waitfor delay '0:00:30'

end

go

 

Once we got that output data from the script in question , we were able to figure out the TOP CPU statement in the problem time with a query such as  the one below substituting the correct times. The time consumed by this one statement was significantly larger than the others during that period and had to be our suspect as during “good” run times, that statement never made it to the top CPU list.

select runtime,Executions,TotalCPU,LogicalReads,Duration,[Avg CPU Time], query_text from tbl_TroubleshootingPlans

where runtime between '2008-11-03 06:00:00.530' and '2008-11-03 06:06:00.530'

order by TotalCPU desc

Note the times in sys.dm_exec_query_stats are generally in Microseconds ( https://msdn.microsoft.com/en-us/library/ms189741(SQL.90).aspx )

If you notice here, the average CPU time was 165 ms in a "bad" run, and on a good run was more like 5 ms so individually you would never see the statement on the radar unless you look at cumulative executions.

Executions

TotalCPU

LogicalReads

Duration

Avg CPU Time ( Microseconds)

DBID

ObjectID

query_text

4333

719013385

661450233

10485137538

165938

6

750625717

select s.SipNotifyCseq     as SipNotifyCseq,   ………..

And then were able to also get the Plan in question from the XML Plan that we saved in our table ( which in 2005 can be saved as a .sqlplan file and opened with Management Studio to get the graphical showplan).

The issue ended up being a Parameter sniffing issue (Parameter Sniffing: https://blogs.msdn.com/queryoptteam/archive/2006/03/31/565991.aspx )

I was able to reproduce the good and bad plan without even having the full database backup just using the Clone database ( https://blogs.msdn.com/psssql/archive/2009/07/08/attach-of-the-clone-databases.aspx ). Also from the XML Plan, we were able to get  the list of compile parameters  and from the statistics we could see that there was a data skew in the table.

Bad Plan

Now this particular plan was compiled for a particular set of parameters and was the right plan for those parameters ( aka @Authz = NULL)  . However due to a data skew that existed, the highlighted index scan was detrimental to performance if executed with any a Non-NULL @AuthZ parameter. In fact the join order in the plan changed as well.

 

|--Compute Scalar(DEFINE:([Expr1009]=(1), [Expr1010]=NULL, [Expr1011]=datediff(second,[@CurrentTime],[RTC].[dbo].[BatchSubParent].[ExpiresAt]), [Expr1012]=NULL, [Expr1013]=(1)))
|--Sort(ORDER BY:([f].[Fqdn] ASC))
|--Parallelism(Gather Streams)
|--Nested Loops(Inner Join, OUTER REFERENCES:([e].[FrontEndId]))
|--Nested Loops(Inner Join, OUTER REFERENCES:([RTC].[dbo].[BatchSubParent].[SubscriberId], [RTC].[dbo].[BatchSubParent].[Epid]))
| |--Nested Loops(Inner Join, OUTER REFERENCES:([RTC].[dbo].[BatchSubParent].[BatchId]))
| | |--Filter(WHERE:([RTC].[dbo].[BatchSubParent].[ExpiresAt]>[@CurrentTime])
                      | | | |--Nested Loops(Inner Join, OUTER REFERENCES:([RTC].[dbo].[BatchSubParent].[BatchId])
| | | |--Index Scan(OBJECT:([RTC].[dbo].[BatchSubParent].[UQ_BatchSubParent]) WHERE:([RTC].[dbo].[ BatchSubParent].[EventId] as [sp].[EventId] = @EventID)
| | | |--Clustered Index Seek(OBJECT:([RTC].[dbo].[BatchSubParent].[PK_BatchSubParent]), SEEK:([RTC].[dbo].[BatchSubParent].[BatchId]=[RTC].[dbo].[BatchSubParent].[BatchId]) LOOKUP ORDERED FORWARD) | | |--Clustered Index Seek(OBJECT:([RTC].[dbo].[BatchSubChild].[PK_BatchSubChild] AS [sc]), SEEK:([sc].[SubscribedId]=(24915) AND [sc].[BatchId]=[RTC].[dbo].[BatchSubParent].[BatchId]), WHERE:([@_AuthZ] IS NULL OR [RTC].[dbo].[BatchSubChild].[AuthZ] as [sc].[AuthZ]=[@_AuthZ) ORDERED FORWARD)
| |--Clustered Index Seek(OBJECT:([RTC].[dbo].[Endpoint].[PK_Endpoint] AS [e]), SEEK:([e].[OwnerId]=[RTC].[dbo].[BatchSubParent].[SubscriberId] AND [e].[Epid]=[RTC].[dbo].[BatchSubParent].[Epid]), WHERE:([RTC].[dbo].[Endpoint].[ExpiresAt] as [e].[ExpiresAt]>[@InvalidEndpointTime]) ORDERED FORWARD)
|--Clustered Index Seek(OBJECT:([RTC].[dbo].[FrontEnd].[PK_FrontEnd] AS [f]), SEEK:([f].[FrontEndId]=[RTC].[dbo].[Endpoint].[FrontEndId] as [e].[FrontEndId]) ORDERED FORWARD)

 

Bad Plan Compile Parameters:

<ParameterList>

             <ColumnReference Column="@_AuthZ" ParameterCompiledValue="NULL" />

              <ColumnReference Column="@_EventId" ParameterCompiledValue="(1)" />

              <ColumnReference Column="@_SubscribedId" ParameterCompiledValue="(14467)" />

</ParameterList>

Good Plan

We also captured a “good” plan when things were fine for this statement in question and as you see the plan is different and we do a seek instead of a scan. Also examining the parameters that the plan was compiled with you clearly see that @Authz is not a NULL.

|--Sort(ORDER BY:([f].[Fqdn] ASC))
|--Nested Loops(Inner Join, OUTER REFERENCES:([e].[FrontEndId]))
|--Nested Loops(Inner Join, OUTER REFERENCES:([sp].[SubscriberId], [sp].[Epid]))
                 | |--Nested Loops(Inner Join, OUTER REFERENCES:([sc].[BatchId]))
| | | --Clustered Index Seek(OBJECT:([RTC].[dbo].[BatchSubChild].[PK_BatchSubChild] AS [sc]), SEEK:([sc].[SubscribedId]=(24915)), WHERE:([RTC].[dbo].[BatchSubChild].[AuthZ] as [sc].[AuthZ]='A') ORDERED FORWARD)
| | | --Clustered Index Seek(OBJECT:([RTC].[dbo].[BatchSubParent].[PK_BatchSubParent] AS [sp]), SEEK:([sp].[BatchId]=[RTC].[dbo].[BatchSubChild].[BatchId] as [sc].[BatchId]),   WHERE:([RTC].[dbo].[BatchSubParent].[ExpiresAt] as [sp].[ExpiresAt]>[@CurrentTime] AND [RTC].[dbo].[BatchSubParent].[EventId] as [sp].[EventId]=(0)) ORDERED FORWARD)
                | |--Clustered Index Seek(OBJECT:([RTC].[dbo].[Endpoint].[PK_Endpoint] AS [e]), SEEK:([e].[OwnerId]=[RTC].[dbo].[BatchSubParent].[SubscriberId] as [sp].[SubscriberId] AND [e].[Epid]=[RTC].[dbo].[BatchSubParent].[Epid] as [sp].[Epid]), WHERE:([RTC].[dbo].[Endpoint].[ExpiresAt] as [e].[ExpiresAt]>[@InvalidEndpointTime]) ORDERED FORWARD)
|--Clustered Index Seek(OBJECT:([RTC].[dbo].[FrontEnd].[PK_FrontEnd] AS [f]), SEEK:([f].[FrontEndId]=[RTC].[dbo].[Endpoint].[FrontEndId] as [e].[FrontEndId]) ORDERED FORWARD)

 

Good Plan Compile Parameters:

<ParameterList>

              <ColumnReference Column="@_AuthZ" ParameterCompiledValue="'A'" />

              <ColumnReference Column="@_EventId" ParameterCompiledValue="(0)" />

              <ColumnReference Column="@_SubscribedId" ParameterCompiledValue="(24915)" />

</ParameterList>

    

The general solutions to parameter sniffing class of problems include

- Option Recompile  with or without an additional hint specifying typical parameters for compilation

- Dynamically executing the statement with Exec or sp_executesql

- Having Nested Sub procedures based on values of the parameters and/or logic in the proc

- Plan Guides

Plan Guides to the rescue

Given that the code of the procedure could not change as it was an application that did not support any schema changes, we had to find a way external to the procedure and try to get it to compile with its typical parameters. SQL 2005  and 2008  have a feature called Plan Guides where we can Freeze the plan of this statement to be the “good” plan that is based on hinting typical parameter values at compile time.

https://msdn2.microsoft.com/en-us/library/ms190417.aspx

https://msdn2.microsoft.com/en-us/library/ms189296.aspx

-- Create the Plan Guide

sp_create_plan_guide

@name = N'MyStoredProcName_Guide',

@stmt = N'select s.SipNotifyCseq     as SipNotifyCseq,

               s.SipCallId         as SipCallId,

……..', -- Statement within the proc causing high CPU

@type = N'OBJECT',

@module_or_batch = N'dbo.MyStoredProcName',

@params = NULL,

@hints = N'OPTION (OPTIMIZE FOR (@_SubscribedId=24915,@_EventId=0,@_AuthZ=''A''))'

Of course longer term that stored procedure in question would have to be changed to account for typical parameters, but in the interim, we got the system back on its feet!

 

-Denzil Ribeiro, SQL Dedicated Premier Field Engineer