TempDB Monitoring and Troubleshooting: Allocation Bottleneck

This blog continues the discussion on the common issues in TempDB that you may need to troubleshoot. In the blog https://blogs.msdn.com/sqlserverstorageengine/archive/2009/01/05/tempdb-monitoring-and-troubleshooting-io-bottleneck.aspx, we discussed how to identify and troubleshoot IO bottleneck in TempDB. In this blog, I will describe how to indentify allocation bottleneck and to troubleshoot it.

As you may recall, the allocation bottleneck is caused when allocation structures are accessed by concurrent threads in conflicting modes. Please refer to https://blogs.msdn.com/sqlserverstorageengine/archive/2009/01/04/what-is-allocation-bottleneck.aspx for details. In this case, the concurrent thread(s) will need to wait to acquire the pagelatch thereby slowing the operation. The good thing is that the pages containing allocation structures (GAM, SGAM, PFS) are well known and have fixed page numbers in each file. For example, in file-id 1, the allocation pages IDs are

· PFS – 1

· GAM – 2

· SGAM - 3

A PFS page will appear every 8088 pages in a file. The GAM and SGAM will appear every 511232 pages and similarly in other files.

Diagnosing:

You can run the following DMV query to find any latch waits that occur in allocation pages

select   session_id, wait_duration_ms,   resource_description

      from sys.dm_os_waiting_tasks

      where wait_type like 'PAGE%LATCH_%' and

            resource_description like '2:%'

Since the database id of TempDB is 2, the search argument ‘2.%’ represents any page in TempDB across any file. If this page happens to be GAM, SGAM or PFS, it will represent allocation bottleneck. Note, in a concurrent application, some blocking is expected so you will need to baseline the allocation waits when your application is performing normally. Only when the waits exceed the baseline significantly, it signals that you are incurring allocation bottleneck.

Troubleshooting:

SQL Server recommends the following

· Create atleast as many files of equal size as there are COREs/CPUs for SQL Server process. The rationale is that at a given time, the number of concurrent threads is <= number of COREs/CPUs. Don’t confuse this with the number of active sessions/batches.

· Enable TF-1118 if you are encountering bottleneck in SGAM structures.

· If you are still encountering allocation bottleneck, you will need to look at your application and see which query plans are creating/dropping objects in TempDB and if these objects are being cached and take corrective action when possible. Most IT shops have very limited choice here as they don’t own the application code.

Here I want to point out one seemingly harmless step to solve allocation bottleneck that can in fact make it worse. Say you are encountering allocation bottleneck and you decide to add one more file to the mix hoping that allocation will spread further. Well, if you recall proportional fill methodology; all new allocations will favor the newer file because it has the most free space. So suddenly, the allocation problem becomes even worse. It is a good practice to restart SQL Server when you add a new file, clearly it needs to be of the same size as other files, to TempDB.

Thanks

Sunil Agarwal