I ran into a scenario a while ago that I thought I would share, where we were troubleshooting high CPU on SQL Server. The sporadic High CPU ( 90+ % ) pretty much brought the box down to its knees for a period of 5-10 minutes minutes or so, and then cleared itself up. We determined from performance monitor that SQL was the one consuming all of that CPU. Initially I tried to draw a correlation to times that it was happening, but wasn’t able to as the times were random. My initial thought was to capture a profiler and get the culprit and tune the query in question. In this case though even running a profiler capturing batch level events was pretty expensive as it was a very chatty server and we had about 350 MB trace files created every minute. Given that we did not know the timeframe the problem would occur, and given the problem period was pretty small running a profiler continually to try to capture an occurrence of the problem was not an option.
We then opted to try the Performance Dashboard for SQL 2005 which is an awesome tool (downloadable from the SQL 2005 Feature Pack), but the primary problem we were facing was that we did not know when the problem would occur, and when we did, by the time we got on any tools the problem would clear itself out. It was random and short lived.
The other thing we knew was that there weren't any long running statements per say as we tried to run some filters to capture statements that took longer than some amount of CPU time and did not come up with much, so this was more the effect of multiple executions of a particular statement during that timeframe than a single execution of a statement.
I reverted to the DMV’s in SQL 2005 to help collect the necessary data and was able to solve the problem just using the DMV's itself . Given that I did not know when the problem could occur and I did not want to capture the data all the time, I had a script to conditionally capture data based on the CPU information in the Scheduler Monitor ring buffer. I used the sys.dm_os_ring_buffers DMV and specifically the records of type 'RING_BUFFER_SCHEDULER_MONITOR' which are captured once every 60 seconds and give you the CPU utilization from the SQL process perspective besides other information. Using the script below, I was able to see if the last 2 ring buffer records had CPU time greater than a particular threshold and if so, I captured the top statements and plans in a table and database that I created. Capturing the DMV information such as this was much less impactful to the system and could be kept on for hours/days unlike profiler in this case.
The script below should work for both SQL 2005 and SQL 2008.
-- This script runs in an infinite loop every 30 seconds ( can be invoked with SQLCMD or management Studio)
-- @SQLCPUThreshold_Percent specifies what CPU threshold we are monitoring
-- If for 2 successive Ring buffer snapshots ( 2 minutes), the CPU is above the threshold we specified,
-- We then collect the top 25 Statements along with their plans in the Database and table we created.
-- Database/Table creation script not included.
Declare @SQLCPUThrehold_Percent int
Set @SQLCPUThrehold_Percent = 75
WHILE (1 = 1)
SELECT TOP 2
CONVERT (varchar(30), getdate(), 126) AS runtime,
record.value('(Record/@id)', 'int') AS record_id,
record.value('(Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)', 'int') AS system_idle_cpu,
record.value('(Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)', 'int') AS sql_cpu_utilization
FROM sys.dm_os_sys_info inf CROSS JOIN (
SELECT timestamp, CONVERT (xml, record) AS record
WHERE ring_buffer_type = 'RING_BUFFER_SCHEDULER_MONITOR'
AND record LIKE '%<SystemHealth>%') AS t
ORDER BY record.value('(Record/@id)', 'int') DESC
-- If the IF statement is satisfied, last 2 Ring buffer records had CPU > threshold so we capture the plans
if ( (select count(*) from #tempCPU) = (select count(*) from #tempCPU where sql_cpu_utilization >@SQLCPUThrehold_Percent))
-- insert top 25 Statements and plans by CPU into the table
insert into Tbl_troubleshootingPlans
SELECT TOP 25 getdate() as runtime,
qs.Execution_count as Executions,
qs.total_worker_time as TotalCPU,
qs.total_physical_reads as PhysicalReads,
qs.total_logical_reads as LogicalReads,
qs.total_logical_writes as LogicalWrites,
qs.total_elapsed_time as Duration,
qs.total_worker_time/qs.execution_count as [Avg CPU Time],
substring (qt.text,qs.statement_start_offset/2,(case when qs.statement_end_offset = -1 then len (convert (nvarchar(max), qt.text)) * 2
else qs.statement_end_offset end - qs.statement_start_offset)/2) as query_text,
qt.dbid as DBID,
qt.objectid as OBJECT_ID,
cast ( query_plan as xml) as XMLPlan
FROM sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(qs.sql_handle) as qt
CROSS APPLY sys.dm_exec_query_plan(plan_handle)
ORDER BY TotalCPU DESC
drop table #tempCPU
waitfor delay '0:00:30'
Once we got that output data from the script in question , we were able to figure out the TOP CPU statement in the problem time with a query such as the one below substituting the correct times. The time consumed by this one statement was significantly larger than the others during that period and had to be our suspect as during “good” run times, that statement never made it to the top CPU list.
select runtime,Executions,TotalCPU,LogicalReads,Duration,[Avg CPU Time], query_text from tbl_TroubleshootingPlans
where runtime between '2008-11-03 06:00:00.530' and '2008-11-03 06:06:00.530'
order by TotalCPU desc
Note the times in sys.dm_exec_query_stats are generally in Microseconds ( http://msdn.microsoft.com/en-us/library/ms189741(SQL.90).aspx )
If you notice here, the average CPU time was 165 ms in a "bad" run, and on a good run was more like 5 ms so individually you would never see the statement on the radar unless you look at cumulative executions.
Avg CPU Time ( Microseconds)