Realistic troubleshooting example of extended events (XEvent) usage in SQL Server 2008 – Part 1

Here’s an example of how to collect and consume extended events in SQL Server 2008. It’s something that we in CSS are starting to use more and more frequently with the uptake of SQL 2008, and I believe it’s something that everyone who troubleshoots SQL can gain benefit from.

Here’s a relatively simple example which is based upon real world experiences of looking at this bug in Connect, which I was looking at recently. The main thing I want to show here is how to review the results of collection yourself, as obviously the setup of the actual event collection is already described on the connect site.

The scenario is relatively simple, and a typical one for extended event usage. You have a process which produces an arithmetic overflow at an unknown point. In the example of this bug, it’s within the SSIS package managing data upload for the performance data warehouse. It could be any large process application or tool though, which would prove more difficult to trace with something like profiler or any of the other troubleshooting tools.

From the connect site, a colleague in the product group has asked the reporting customer to configure and run the following event session:

 -- XEvent Session Definition Arithmetic error, Log additional info
  
 -- Sample Query: select convert(bigint,21568194 * 100)
 if exists(select * from sys.server_event_sessions where name='ArithmeticError')
      DROP EVENT SESSION ArithmeticError ON SERVER
 go
  
 CREATE EVENT SESSION ArithmeticError ON SERVER
 ADD EVENT sqlos.exception_ring_buffer_recorded (
 ACTION (sqlserver.tsql_stack, sqlserver.sql_text)
                  WHERE (error = 8115 ))
 ADD TARGET package0.ring_buffer,
 ADD TARGET package0.asynchronous_file_target(
 SET filename='c:\temp\ArithmeticErrors.xet', metadatafile='c:\temp\ArithmeticErrors.xem')
 WITH (MAX_MEMORY = 4096KB, 
 EVENT_RETENTION_MODE = ALLOW_MULTIPLE_EVENT_LOSS, 
 MAX_DISPATCH_LATENCY = 1 SECONDS, 
 MAX_EVENT_SIZE = 0KB, 
 MEMORY_PARTITION_MODE = NONE, 
 TRACK_CAUSALITY = OFF, 
 STARTUP_STATE = OFF)
  
 go
  
 ALTER EVENT SESSION ArithmeticError ON SERVER
 state=start
 go
  
 -- Cleanup the session 
 -- drop EVENT SESSION ArithmeticError ON SERVER
  
  

You can run this code as is on a test server to try out the demo. You’ll need to ensure that you have a c:\temp directory for the results to output to a file. There are several different ways that you can consume or record the output from an extended event, but here we are writing to a file as said files can then easily be uploaded to us here at Microsoft. You can read about some of the other alternatives in the Books Online link above if you want.

Once you have the event set up, you can trigger the 8115 error, by un-commenting the “sample query” line at the top of the script. When you run it in SSMS, you should encounter the following error:

Msg 8115, Level 16, State 2, Line 1

Arithmetic overflow error converting expression to data type int.

If you then look in the c:\temp directory, you ought to notice the files the session is writing to have been created, however at this stage, the .xet file (which actually holds the event data as opposed to the xem file which holds the event meta data required to decode the event itself) is actually empty. This is because the session is still active and the data has not been flushed to the file. If you run the following to stop (but not drop) the event

  
 ALTER EVENT SESSION ArithmeticError ON SERVER
 state=stop
 go

Then the data will be written to the file. You’ve now capture enough data for the purposes of this test, so you can drop the event session, the syntax for which is

 drop EVENT SESSION ArithmeticError ON SERVER

So the question then comes, what to do with the files. I have to admit it took me a while to work this out, as it wasn’t obviously documented. I also needed the help of my good friend and pool partner, Spike, to write some XQuery in a nice way, as I don’t really like XQuery. The function is documented in Books Online though to be fair (although not the XQuery!)

There are 2 different ways you can go about reading this type of file in my opinion. Firstly you can just run the following query in SSMS and you’ll get an XML column in the results pain, if you output to grid. You can then just click on the column and you’ll get to look at the XML document in a new SSMS window. Quick and undeniably dirty yes, but it works quite effectively for a simple scenario like this:

 -- straight select, view the XML manually
 select CAST(event_data as XML) , * 
 from sys.fn_xe_file_target_read_file
 (
 'C:\temp\ArithmeticErrors_0_128981030796910000.xet', 
 'C:\temp\ArithmeticErrors_0_128981030796920000.xem', 
 null, 
 null
 )

Note that the actual file names have been modified automatically by the event engine to guarantee uniqueness. When you run this, you should get an output grid which looks like this:

image

If you then click on the first XML column, you’ll open up a new tab with the XML document for viewing, which will look something like this:

 <event name="exception_ring_buffer_recorded" package="sqlos" id="21" version="1" timestamp="2009-09-22T14:33:06.537Z">
   <data name="id">
     <value>0</value>
     <text />
   </data>
   <data name="timestamp">
     <value>0</value>
     <text />
   </data>
   <data name="task_address">
     <value>0x00000000000ffdc8</value>
     <text />
   </data>
   <data name="error">
     <value>8115</value>
     <text />
   </data>
   <data name="severity">
     <value>16</value>
     <text />
   </data>
   <data name="state">
     <value>2</value>
     <text />
   </data>
   <data name="user_defined">
     <value>false</value>
     <text />
   </data>
   <data name="call_stack">
     <value />
     <text />
   </data>
   <action name="tsql_stack" package="sqlserver">
     <value>&lt;frame level='1' handle='0x01000100FC030E3850E97786000000000000000000000000' line='1' offsetStart='0' offsetEnd='0'/&gt;</value>
     <text />
   </action>
   <action name="sql_text" package="sqlserver">
     <value>select convert(bigint,21568194 * 100)</value>
     <text />
   </action>
 </event>

Most of this XML ought to be self explanatory. The bit that is most interesting to us here is the “sql_text” section which shows the actual text being executed. In a real world scenario where we didn’t know what code was causing our error, we’ve now tracked it down.

The tsql_stack section is also useful. You can pass the handle you get back from this section into a DMV to get further details about the object or statement being executed (if it’s part of a stored procedure for example). A typical syntax for this would be

 SELECT * FROM sys.dm_exec_sql_text(0x01000100FC030E3850E97786000000000000000000000000)

However, you should note that in this very simple example, this will return nothing. This is because a plan has not been cached for our simple select statement that only does a mathematical  calculation. This is of course possible in other scenarios where the plan which was used at the time the event captured it, is no longer cached, due to it being aged out, or the server being re-started.

If you want to follow this example and be able to look at the sql_text plan handle, then you need to write an actual query which fetches data and has an arithmetic overflow. For example, simple procedure like this

 create procedure mybadproc
  
 as
  
 select *
 from sys.servers
  
 select convert(bigint,999999999 * 100)

will cache a plan when executed, and will be captured by the same event session we set up above. If you execute this, or other more complex statements / procedures, you can get back multiple items in the tsql_stack, each with separate handles which you can then pass into DMV based queries which accept plan handles. If you capture the execution of the above procedure in this event session, you’ll get back a slightly different XML document, where the tsql_stack has multiple values and the sql_text is now the name of the procedure that we executed. You can then pass any of the handles into DMV queries which accept such parameters.

 <action name="tsql_stack" package="sqlserver">
    <value>&lt;frame level='1' handle='0x03000D00E7A4787D959020018C9C00000100000000000000' line='8' offsetStart='132' offsetEnd='-1'/&gt;
           &lt;frame level='2' handle='0x01000D00D8E5380F60E77786000000000000000000000000' line='1' offsetStart='0' offsetEnd='-1'/&gt;</value>
    <text />
  </action>
  <action name="sql_text" package="sqlserver">
    <value>mybadproc</value>
    <text />
  </action>
 /event>

So to summarise part 1 of this post, we’ve looked at how to setup an extended event session and capture the output to a file. We’ve then seen a very simple way of looking at the output to manually identify a condition captured by our session. In the next part I’ll show some XQuery to review the output in more detail.