Normalizing query text

A conversation with a friend over email got me to thinking that this info might be useful to a few of you.

It’s common for SQL Server application developers to want to “normalize” the query text returned in a Profiler trace such that executions of the same query with different parameters can be easily compared with one another and aggregated together.  This allows the performance of a query to be more easily tracked over time and allows the consequences of using different parameters with it to be quantitatively measured.

Normalizing the text of a query amounts to replacing the literals in it with parameter markers.  It amounts to flattening the parameter values for the query such that the statistics (e.g., duration, CPU used, reads, and writes) from different executions of the query can be aggregated together and compared against one another.  So, take this query for example:

SELECT * FROM foo WHERE id=4

Here, the parameter is “4”, so, parameterized, the query would look something like this:

SELECT * FROM foo WHERE id=@p1

Then, every other execution of this query, regardless of the value on the right side of the equal sign, would reduce down to this same basic form.  This is the same basic form you might see in the procedure cache if you viewed dm_exec_cached_plans and dm_exec_sql_text.

There are a couple of approaches commonly used to normalize queries.  The first, and simplest, is to take a substring of the query in question and try to trim it down to its bare essence.  I’d guess most people looking at this sort of thing take this approach.  The problem with it is that parameters can appear virtually anywhere, especially when you consider subqueries, derived columns, and the like.  There’s no set substring length that you can always take to accurately normalize a query.  Thus, this approach is extremely error-prone and of limited usefulness in truly normalizing the queries from a trace.

Another approach is to parse the T-SQL text using your own parser.  This solves the problems of the simplistic substring approach, but unfortunately introduces some of its own.  First, parsers are notoriously difficult to write.  Even with the help of tools like Lex and Yacc, you still have a fair job on your hands to create a language parser, especially for a language as unwieldy as Transact-SQL.  Second, most SQL parsers available don’t handle the extensions Microsoft has made to Transact-SQL that well.  There are many free parsers out there, as well as some available for minimal charge, and most, if not all, have problems with Transact-SQL’s extensions.  Third, even for those that may fully support Transact-SQL at a particular point in time, they can quickly become outdated as the product evolves.  They have to keep pace with the evolution of SQL Server in general and with the T-SQL language in particular, and they may not be able to do that.  Consider, for example, the kind of monkey wrench supporting CTEs would throw into such a facility.  Last, but not least, it takes time to parse code.  With a trace of any size, even a highly efficient parser would need significant time to parse it.  Full-blown parsing is a bit overkill, anyway.  What you really need is a literal identifier.

Fortunately, SQL Server 2005 offers a facility that partially solves the problem.  The sp_get_query_template stored proc takes the text of a query as a parameter, then returns the normalized text and the parameter list as output parameters.  Although it was not designed for general query normalization, this gets you as close as you currently can get to invoking SQL Server’s own code to normalize your queries.

Unfortunately, the proc has several limitations: it doesn’t handle multi-statement batches, will throw a 207 error if your query text references a non-existent column in an existing table (just as a CREATE PROC will), and will throw an error when a statement doesn’t have any parameters (e.g., SELECT * FROM sys.objects).  But, it’s part of the way there.  For simple query parameterization, it’s probably the best way to go for now.  In concert with some substring trickery, this might get you most of what you need to do some basic query performance aggregation.

You can use sp_get_query_template against the TextData column returned by fn_trace_gettable to generate some useful query performance aggregations over time.  You can easily identify your N most expensive queries in terms of CPU used, reads, writes, duration, etc.  And you can then drill into individual executions from the trace to see what sets of parameters may be more or less likely to be the cause of poor performance from the query.  This can then inform your query and indexing design decisions.