Doctor, this SQL Server appears to be sick….

I thought the PASS Summit was a great success this year. I always enjoy speaking at the PASS Summit (this year I spoke on Wait Types. I’m working now on starting to populate the Wait Type Repository) but I also really enjoy working at the SQL Server Clinic. This year CSS combined its efforts with the SQLCAT team. As part of that effort, I though you might want to see a sampling of some of the questions I received and the solutions provided for them:

1. What is the EXECSYNC wait type?

A customer attended my talk on wait types and just happened to hit an issue in production before the conference where he saw a wait type of EXECSYNC and wanted to know what it meant. If you look at the BOL for sys.dm_os_wait_stats it says

Occurs during parallel queries while synchronizing in query processor in areas not related to the exchange iterator. Examples of such areas are bitmaps, large binary objects (LOBs), and the spool iterator. LOBs may frequently use this wait state.

But what I also found is that this wait type is marked for a task related to a parallel query when the task is waiting for other tasks in the query to abort. He told me that it was true that the state of the request was KILL/ROLLBACK and that in fact they had tried to kill the query. If my memory serves me correctly this was also not just a query but an ALTER INDEX. He also said that several tasks for the session were waiting on EXECSYNC. I asked him what task as part of this parallel query didn’t have this wait type? The customer didn’t know the answer but I explained this is what you should focus on. In this case like CXPACKET, a long EXECSYNC wait is just a symptom of something else. It just indicates coordination among parallel query tasks so find the status of the task that doesn’t have this wait type.

When I get questions like these after I answer the initial one I often go back to the source of the question to ensure the “right” problem is being addressed. So I happened to ask “Why do you need to KILL the ALTER INDEX?”. The customer looked at me and agreed that perhaps the right thing to do was to figure out why the command needed to be killed. He said that this index rebuild normally ran in minutes but sometimes would just “hang”. So the key to this problem is not why EXECSYNC is showing up but rather why is the index rebuild “hanging” in the first place.

2. Why is the model tlog growing?

A customer told me the transaction log size of the model database was growing unexpectedly on his system and he could not figure out why. He doesn’t ever change model or has any application that even accesses model. My first thought was something was accessing model. The tlog just doesn’t grow on its own. Something is logging transactions. We could use SQLTrace or even try to dump out the log but the customer was curious as to my opinion of what could cause this knowing these tracing options were available.

As we both talked about possibilities, something occurred to me. I asked him “Do you backup model?”. He said in fact, yes they back it up nightly. I asked him how they backed it up and he said a full database backup.

Bingo. this is the likely answer. When you perform a full backup a database this itself is a logged operation and the tlog is not truncated. So since model is a FULL RECOVERY database, after time a series of full backups will cause the log to grow.

3. I’m seeing CPACKET causing problems for my server

A customer came in and told me he heard I “knew everything about wait types”. Flattering but I’m not sure anyone can claims this. But I did just give a talk on the topic so offered to help. The customer told me that “CXPACKET was causing him problems on his server” Now if you saw my talk my immediate reaction to this statement is “Are you sure?”. Why do I ask this? Because the CXPACKET wait type is just an indication of coordination among parallel operators, that’s all. It in itself is not a wait on a resource. It is a “synchronization” wait. If you see the wait time for this as “high” it simply means that you have:

a) Parallel queries running on your server

c) Parallel queries that may be taking longer than you expect

The key here is a) Do you expect parallel queries? b) If not, you perhaps need some query tuning c) If so, go investigate one of these with high wait times for CXPACKET and find out what the tasks with the same session_id look like that DON’T have this wait type. If they are RUNNING then you simply have a query that just takes a long time (with perhaps a non-optimal plan). If the other task has a wait type of something like ASYNC_NETWORK_IO you have a problem with processing results from the client that could be slowing down the overall query. In either case,  the appearance of a parallel query may be a sign of possible tuning opportunities on your system.

4. Can’t you tell me what is wrong with my I/O?

Antibiotics won’t cure this one<g> One of the toughest problems I see in CSS is symptoms from SQL Server that some type of delay is occurring with the I/O system on the computer or attached disk system. The reason it is so tough is that it is very hard often to figure out what “layer” of the I/O stack or I/O “path” is causing the problem.

Some common symptoms in SQL Server of I/O delays are:

a) High IO Latch Waits

b) High WRITELOG Waits

c) A message in the ERRORLOG that looks something like this:

SQL Server has encountered X occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [<filename>] in database [<dbname>] (<dbid>).  The OS file handle is <hex value>  The offset of the latest long I/O is: <hex value>

In each of these cases, SQL Server is just a “victim”. The engine is just producing diagnostics that indicate a Windows API call to read or write data is delayed returning back from the Operating System. A Windows programmer has no other data to give you insight into the delay at the kernel level or hardware. The big question is “where is it delayed?”. It is a classic question for which there is likely not one great solution for all.  There are all types of techniques for you to use, but no single way I know of to trace down the layers of the I/O stack like you can over the network with Netmon. I’m not saying don’t call SQL CSS if you have problems like this. But I am saying don’t be surprised if we tell you what you can already know by reading this. This is not caused by SQL Server, but rather a sign of “something else” causing an delay in I/O.

5. I’m seeing the leader blocker have a SPID=-2

In 2008 at PASS, I had a customer ask me why the lead blocker on his server was session_id = –4. I didn’t know off the top of my head but I asked him to email me the details. I quickly found out that this scenario is one in which a task is blocked on a latch but we could not figure out the owner (for performance reasons). This isn’t always the case but if this occurs, we just mark the blocking session id as –4.

Well what do you know but the same customer came to the clinic this year this time blocked by session_id = –2. This one I had seen before and off the top of my head knew it meant an orphaned distributed transaction. In fact there is a fairly lengthy discussion of this on this blog and in the SQL BOL T-SQL KILL command as special syntax is available to free up this situation.

What we don’t know is how he got into this situation. We suspect it is the design of the application which uses a linked server stored procedure which on the remote server calls back into the original server all part of the same execution.

6. Why do I have to wait to get into the pool?

Another wait type issue came up while I was in the clinic. A well-known MVP asked me what I knew about the THREADPOOL wait type. Turns out he came to my session but had to leave early and missed my slide on THREADPOOL. He has a production server where they see wait stats showing this wait type but are not sure why.

THREADPOOL is a special wait type that is used because of a wait on a resource. Why special? Because you will never see this in sys.dm_exec_requests. THREADPOOL waits only occur when a task is waiting to get assigned to a worker thread. In other words, the resource problem is worker threads. Since a request means a task is already bound to a worker, you will only see this wait in sys.dm_os_waiting_tasks. And… since this wait means you are out of worker threads, it is likely your connection to run sys.dm_os_waiting_tasks will not work since you need a worker thread just to process your login or query. So typically I only see this when using Dedicated Admin Connection, in a user dump, or in sys.dm_os_wait_stats after the fact.

Don’t jump out there and think the solution when seeing this is to bump up ‘max worker threads’. In fact, in 2005 and 2008, this value is typically 0 in sp_configure and set to a number based on your CPU architecture and # of CPUs.

In most cases the cause of THREADPOOL waits is that workers are tied up with a long blocking chain or running long running pre-emptive work such as extended stored procedures. Find the original cause of why all workers are tied up before you consider bumping up ‘max worker threads’.

You can read more about Jonathan Kehayias’s tale on his blog at https://sqlblog.com/blogs/jonathan_kehayias/archive/2009/11/12/the-benefits-of-attending-pass-realized.aspx

Thanks for kind remarks Jonathan. We too think the Clinic adds great value to PASS and it is really satisfying to see short-term results from customers like this.

7. What about the TokenAndUserPermStore?

I had a customer who wanted more details on what the TokenAndUserPermStore is used for? I was interested to know how this customer knew what this term meant. Turns out they had an issue with performance and CSS pointed them to this as the problem.

I explained that this term referred to a cache of security credentials for non sysadmins on metadata. Several bugs and architecture problems in SQL Server 2005 with this cache have resulted in high CPU problems involving high spinlock contention for several customers.

Some customers have seen this cache still grow and cause them perf issues on SQL Server 2005 even after applying specific targeted fixes. So I pointed this customer to this blog post on security cache to get the details on what this cache is, how to detect if this is your issue, important bug fixes to make sure you have installed, and trace flags introduced in SQL Server 2005 SP3 to control the cache.

One little nugget I told this customer that I had seen was that the security cache could grow when you don’t expect it based on how the application was developed. I’ve seen cache entries for objects like a stored procedure grow over time because the application used methods that required the provider to get metadata about the procedure. For example, in ADO if you call something like Parameters.Refresh the provider will send across a system stored procedure called sp_procedure_params_rowset. This is used to “describe” the metadata of the procedure including its parameters. If the app makes this call on behalf of a non sysadmin user, the “tokenperm” cache will require entries for that user for the proc. Expand this to many of your procs and the cache can grow quite large just for these types of operations.

It is important to note that in SQL Server 2008, we made several changes to avoid the cache becoming a  performance problem. While the cache still may grow on your system for various reasons it should not cause the spinlock contention as seen in SQL Server 2005. Also, sp_configure options were added to add more fine grain control over the cache vs trace flags in SQL Server 2005.

8. I’m getting login failed in the ERRORLOG every 1 minute

This was an interesting tale. A customer brought up his laptop to me in the clinic with a terminal server session to his production machine running SQL Server 2005. In the Event Log was a bunch of login failed messages that looked like this:

2009-11-24 09:26:44.12 Logon       Error: 18456, Severity: 14, State: 16.
2009-11-24 09:26:44.12 Logon       Login failed for user 'NT AUTHORITY\SYSTEM'. [CLIENT: <local machine>]

I quickly was able to lookup State 16 and found out the failure was an invalid database context. We also noticed that the login was coming from something on the same computer as SQL Server (the local machine). The customer said he understood this but he didn’t know of any SQL application logging on locally to the server? I told him one solution was to use SQLTrace to find out the source of the failed logins. But before he turned this on, I noticed the failed logins were exactly every 1 minute apart. I asked him if he had any SQLAgent jobs set to run every 1 minute. He didn’t think so but we checked anyway.

Sure enough a SQLAgent job was set to run every 1 minute and was configured with a T-SQL job step with a “blank” database context. Now this really puzzled us as there is now way to create a SQL Agent T-SQL job step with a blank database context. The job definition looked something like this

image
The customer noticed in the “Command” window a stored procedure he recognized. So he opened up an IM session with a colleague back at the office and asked him about this proc. Turns out the procedure had moved to a new database and the old database had been deleted. But….alas someone forgot to update the job to the new database name. So when you remove a database, the job is not invalidated. Instead when you look at the job definition in the GUI, the database shows up “blank” because it no longer exists. If you look at the job definition in sysjobsteps in msdb the removed database name still exists for the step context. So a nice lesson learned for a possible enhancement for SQLAgent is to warn the user that the database referenced in the job no longer exists instead of just showing this as “blank”.

Notice what the new error message looks like in SQL Server 2008 in the ERRORLOG for this problem.

2009-11-24 09:16:38.07 Logon       Error: 18456, Severity: 14, State: 38.
2009-11-24 09:16:38.07 Logon       Login failed for user 'NT AUTHORITY\SYSTEM'. Reason: Failed to open the explicitly specified database. [CLIENT: <local machine>]

So no longer do you need to know what the State number means (in fact noticed the State number has changed). The text in the ERRORLOG gives you the reason. This is the case for many of the login failed states but you can only see this in the ERRORLOG as an admin. Another possible enhancement to the context is to put in the Application name along with the client machine. This way we would have known from the ERRORLOG message from the beginning it was a SQLAgent job.

 

Not all of the questions I received in the clinic were answerable “on the spot” as these but for the most part I was able to give some direction to just about every customer that came into the clinic and talked to me directly. I think this was the case for most of the customers that came into the clinic and spoke to CSS or the SQLCAT team. It was not unusual at all for customer to bring their laptop with a terminal server session and let one of our CSS engineers give their opinion about a current production situation fairly quickly. Of course the great thing about this 1;1 interaction is that we were able to teach customers how we approach these kinds of problems. Also you should know that we answered questions on any topic related to SQL Server in the clinic. Analysis Services, Replication, Reporting Services, Connectivity, Clustering just to name some others. And with the SQLCAT team in the room we often “partnered” to talk to a customer about their architecture and design and combine that with CSS knowledge on “what to look out for” or collaborate on a complex problem.

 

Bob Ward
Microsoft