Modern NAV/SQL troubleshooting

This post is the first in a planned series to describe various "modern" methods for troubleshooting performance problems with Microsoft Dynamics NAV on SQL Server. The idea is also to make the best out of information that already exists, either in other places on this blog or anywhere else.

Modern Troubleshooting:
The idea about "Modern Troubleshooting" is:

  • To take as much of the hard work out of troubleshooting as possible.
  • To make the system itself (SQL Server and NAV) help you finding out what is wrong with it.
  • To collect as little and as specific data as possible, to avoid having to read through for example gigabytes of trace files.
  • To do the troubleshooting remotely as much as possible, rather than spending days on an individual system on site.

I want to make it clear that with "Troubleshooting", I mean "Trouble". And the methods I describe here will be quick-fixes for when something has gone wrong already. I don't mean to reduce the need for thorough performance- or scalability reviews, which can still require lots of work and understanding of the customers business. So I definitely don't want the "Modern Troubleshooting" to replace specialist consulting and ongoing (especially pro-active) performance tuning. Of course it is always better to avoid Trouble in the first place.

Identifying Trouble-queries:
One of the most useful queries I have used for performance troubleshooting recently, is the "TOP 30"- query described here:
Simple query to check the recent performance history

It returns the queries that are currently in SQL Server's plan cache, sorted with the queries causing most reads at the top. Every time SQL Server generates a query plan, then this plan is stored in this cache. And every time SQL Server (re)uses this plan it will also update some statistics about the plan, like execution_count, number of reads, etc. It is this information that the "TOP 30" query gives you.

SQL Server's plan cache is changing all the time, depending on what queries SQL Server runs, so you may get different results depending on what time of the day you run the query. The plan cache is also reset when SQL Server restarts, or you can reset the cache by running DBCC FREEPROCCACHE.

The result of the "TOP 30"-query is easy to copy into an Excel spreadsheet for further analysis, or to send to someone else. So it is simple data to collect, as long as the customer is on SQL2005 or later (Being based on Dynamic Management Views (DMW) which was introduced in SQL Server 2005, it not work for SQL Server 2000).

 

What to look at:
After receiving the result of the query - preferably in an Excel spreadsheet - these are the things I look at:
Being sorted by "max_logical_reads", you have the "worst" query at the top. But also look at execution_count. If a query ran just once, it might have been a batch job, or something else that is not really causing any problems. Of course, queries that have an execution_count in the 100s or 1.000s may be more relevant to look at. Also see if the queries (stetement_text) look similar, or if many of them look to be in the same area (same tables).

The column diff_quota shows max_logical_reads divided with min_logical_reads. If this number is high, it means that the query plan is inconsistent. This can be either because of inconsistent use of NAV (for example users applying different filters on the same table). Or, because a query plan is good for some queries but bad for others. In this case, some times you can affect the way that SQL Server creates query plans, either by adding RECOMPILE hints, plan guides or index hints. Or by upgrading to a newer version of NAV client (for example see the post "SQL Pre-processing in Microsoft Dynamics NAV 5.0 SP1" for how NAV 5 SP1 will cause different query plans).

If diff_quota is low, it means that the query plan is just consistently bad, which means that it is more likely that the query itself is bad. And, you will have to look for reasons why that query consistently causes the number of reads that it does. The "TOP 30"-query can't really help finding out why a query is causing many reads. But at least it can some times identify which queries to investigate first, which can be a very time consuming task otherwise (collecting and analysing profiler traces, etc).

You can also look at max_elapsed_time, but keep in mind that when a query takes a long time to run because it is being blocked, then the real problem is somewhere else (in the blocking query). So, if a query has a high max_elapsed_time, then see if the query contains a lock (WITH UPDLOCK). If it does, then you are most likely looking at a blocking problem which requires a wider look, and which often cannot be solved by the query you see.

So the "TOP 30"-query is always a good place to start. Some times it won't help you solving any problems, but then you will know that without spending much time. Other times, the query tells you right away which queries are causing problems, which can save you a lot of time. So I would always run this query first. And only if it doesn't help, then begin to look at collecting further information.

 

 

Lars Lohndorf-Larsen

Escalation Engineer

 

These postings are provided "AS IS" with no warranties and confer no rights. You assume all risk for your use.