Issues with the system_health session in SQL Server 2012

The ever alert Jonathan Kehayias (Blog | Twitter) sent me a question recently asking about the Extended Event UI used for showing predicates. In particular, he was wondering about the predicate for the wait_info event that is defined in the system_health session and was wondering what was going on.

wait_info_pred

It’s obvious what peaked Jonathan’s interest – all those error icons seem like there is a problem with the predicate. This post will explain what’s happening, why it’s happening and also document a problem that I discovered while investigating this and for which I will give you a solution.

Bug or Artifact – You make the call

I’d actually call this an artifact because the rows that have error icons are caused because of a change made somewhere else in SQL Server that result in the Filter UI not being able to resolve the integer value (map_key) we specify in the predicate to the name of an actual wait (map_value). The reason that the map_key cannot be resolved is because of changes made within SQL Server to the enumeration of our wait types. These changes don’t impact you directly, but it turns out they do mess with how our UI displays this filter. This is a little complex, but I’ll attempt to give you the short version…

In order to save space we’ve created a special data type in extended events called a map, which allows us to use an integer to represent some known value. (OK, this probably isn’t really that complex for all you DBAs since you do it all the time with PK/FK relationships and lookup tables.) We store these mappings in dm_xe_map_values and the UI uses this DMV to resolve the map_value when it displays the filter. The wait_type field in the wait_info event is one of these map data types. Let’s take a simple example, we might write a filter for wait_info such as:

wait_type = 120

A quick check of dm_xe_map_values shows us that 120 is equivalent to the SOS_SCHEDULER_YIELD wait type, so the UI would display:

wait_type | equal_uint64 | SOS_SCHEDULER_YIELD

In the case of wait_type we get this mapping directly from within the SQL Server engine (this is true of many maps) so any changes to the list of wait_types are automatically represented in our map DMV. The reason you see blanks in the Value field in some cases is because the list of wait_types is not always contiguous; in other words, there are gaps in the map_key list because there are some numbers that have not been used yet. The change I referenced earlier, is that the unused regions used to be filled with place holder values (PADDING_##) but are now just missing in SQL Server 2012. You can convince yourself of this very easily by running the following query:

 SELECT TOP 50 * FROM sys.dm_xe_map_value

What you’ll see is the following:

map_keys

values

0 UNKNOWN
1-21 Lock waits
22-31 Not used
32-37 Latch waits
38-47 Not used
48-50 Pagelatch waits

In the system_health session we specify a filter that selects for the waits that we want to collect, but but instead of specifying each wait independently we specify ranges. So for example, to collect all the PAGELATCH* waits, we’d use this predicate:

 (wait_type > 47 and wait_type < 54)

But from the table earlier in the post we know that wait_type 47 is not used, so when the UI looks up the value for 47, it comes back empty, which it perceives to be an error and marks it as such. Logically, this in not an error because 48 > 47 whether there happens to be a 47 in the list or not. The predicate still works, it’s just that the UI doesn’t know how to display it.

You said there was an “issue”

Yes I did, and thank you for reminding me.

While I was investigating the UI artifact I noticed that there have been several additions to the wait_type list, and a few deletions, in SQL Server 2012 that has resulted in significant renumbering of the wait_types map. This has had the unfortunate result of changing the meaning of the filter that we specified for both wait_info and wait_info_external, to the point where system_health is not longer collecting the waits we consider important. This is definitely a bug, but one that we won’t be able to address until after RTM. Since we can’t get the fix in for RTM, I’m providing a script that will correct the system_health session here.

This script will do the following:

  • If the system_health session exists, it will remove the wait_info & wait_info_external events and replace them with the corrected filter. Any modifications you’ve made to these events in your system_health session will be lost.
  • If the system_health session does not exist, it will created it.
  • If the system_health session is not running, it will be started.

Download the fix for the system health session