SharePoint and SQL Server’s Error 8628

Hello! Michael Garstin here. I’m one of the newest members of the SharePoint Escalation Team. I came here after spending more than 7 years on the SQL Server CTS Team. Naturally, my blog posts will have a SQL Server focus and will often be based on real-world cases I have worked on.

Error 8628 should be a very rare occurrence for any SQL Server system; yet I saw two cases for this within a few days of each other. The format for this error message looks like this: A time out occurred while waiting to optimize the query. Rerun the query.

As you can see, the query is not yet running on the server, it is still waiting to be optimized (in other words, get an optimal execution plan compiled) before being sent to the execution engine. The timeout is defined for 6 minutes – an eternity for today’s servers – and is a memory-based error message. It is more likely to occur on 32-bit servers than on 64-bit servers.

A Little Bit Of SQL Server History

In SQL Server 7.0 the User Mode Scheduler (UMS) component was created. It was designed to manage the thread scheduling needs for SQL Server rather than allowing the Windows dispatcher complete control of SQL Server threads. This improved the database engine scalability. In SQL Server 2005, UMS evolved into SQLOS and memory management functions were included with the scheduling controls. UMS and SQLOS create Schedulers to abstract the server CPUs. SQLOS creates MemoryClerks and MemoryBrokers to handle memory management.

In SQL Server 2000, query compilation was limited by your hardware. This can lead to problems such as one complex query compilation stealing most or all of the buffer pool memory. In SQL Server 2005, SQLOS introduced the compile gateways to throttle compilations.

How Do The Compile Gateways Work?

The compile gateways are built around the Windows Semaphore synchronization object. Unlike a Critical Section, the Semaphore can allow more than one thread access to a section of code.

The Query Optimizer will usually “re-write” a query as it considers the myriad ways that it can use to return results. Set-based algebra has similar properties to mathematical algebra (I’m sorry if this brings up nightmares for some of you). For example, consider the following query:

SELECT * FROM A INNER JOIN B ON (A.a = B.b) INNER JOIN C ON (A.a = C.c) INNER JOIN D ON (A.a = D.d)

The inner joins can be evaluated in any order (ABCD, ABDC, ACDB, etc.) without changing the results of the query. As more tables are added, the possible plans to evaluate increases exponentially. Fortunately, the Query Optimizer does not have to consider all the plans and it contains quite a few tricks to make this process as fast as possible. However; this work does require system resources in terms of CPU time and memory.

With complex queries, the compile gateways throttle the number of compiling queries and prevent them from stealing too much memory from the buffer pool. I like to think of each compile gateway as a successively finer-grained sieve.

There are 4 x SQLOS Schedulers number of Small Gateways (4 x 8 = 32 configured units on my customer’s server). The Medium Gateways have a 1:1 ratio with SQLOS Schedulers and there is only 1 Big Gateway per SQL Server instance.

You can view information about these gateways with the command: DBCC MEMORYSTATUS

This command has a verbose output with many record sets so I recommend switching the Management Studio output to Text Mode (CTRL+T) first. The sections that we are interested in are near the end of the results. Here is a sample from my customer’s case during the problem:

Optimization Queue Value
------------------------------ --------------------
Overall Memory 43033559040
Target Memory 33818238976
Last Notification 1
Timeout 6
Early Termination Factor 5

Small Gateway Value
------------------------------ --------------------
Configured Units 32
Available Units 0
Acquires 32
Waiters 63
Threshold Factor 380000
Threshold 380000

Medium Gateway Value
------------------------------ --------------------
Configured Units 8
Available Units 0
Acquires 8
Waiters 22
Threshold Factor 12
Threshold 88068330

Big Gateway Value
------------------------------ --------------------
Configured Units 1
Available Units 1
Acquires 0
Waiters 0
Threshold Factor 8
Threshold 528409984

Most queries do not need these gateways at all; they are free to compile and you are only limited by your CPUs. If a query requires at least 380,000 bytes (on x64 hardware, x86 and IA-64 have different hardcoded values) to build an execution plan, then it must acquire the Small Gateway. You will see the Acquires value increase and the Available Units value decrease appropriately.

In my customer’s case, all 8 Medium Gateways have been acquired and there are currently 22 other queries waiting to acquire a Medium Gateway. If a SQLOS Task has acquired a gateway, it also must acquire any of the coarser-grained gateways. In other words, acquiring the Big Gateway means that the Task has also acquired one of the Medium and Small Gateways too. This also means that 8 of the Small Gateways are acquired by the medium queries. There are 24 other queries that have acquired the Small Gateway and 63 more that are waiting there.

That is a lot of waiting! How can you tell that this happening without digging into DBCC MEMORYSTATUS every few minutes? Here are two queries that you can use to monitor performance data of the gateways:

/* SQL OS Tasks currently waiting for a compile gateway
*/
SELECT GETDATE() AS check_time, session_id, wait_type, wait_duration_ms
FROM sys.dm_os_waiting_tasks
WHERE wait_type = 'RESOURCE_SEMAPHORE_QUERY_COMPILE'

/* Total time since server-start spent waiting for a compile gateway
*/
SELECT GETDATE() AS check_time, wait_type, waiting_tasks_count, wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_type = 'RESOURCE_SEMAPHORE_QUERY_COMPILE'

If a SQLOS Task waits to acquire a gateway for more than 6 minutes the Task will be cancelled. Error 8628 will be sent to the application and logged into the SQL Server errorlog file.

What Does SharePoint Do To Cause This Error?

Typically, very complex queries will need to acquire the compile gateways. When you build a list view in SharePoint, the data for each view column is stored in the AllUserData table. The view UserData shows the current versions of a list item in AllUserData.

The AllUserData table contains several nullable columns for storing various data types in your list view. These columns are creatively named: nvarchar1, nvarchar2, nvarchar3…int1, int2, int3, int4…datetime1, datetime2, etc. Here is a summary for you.

DateType

Number Of Columns

bit

16

datetime

8

float

12

int

16

sql_variant

8

ntext

32

nvarchar

64

So what happens if you want to put 17 columns that store integers in your view? SharePoint inserts a second row into AllUserData and increases the tp_RowOrdinal column by 1. Each time one of these data type thresholds is exceeded, a new row is added to the table. When your list view is being rendered, SharePoint builds a dynamic query to join all of these rows back into a single record set.

Perhaps you can already guess how this ties in to the query compile gateways…

If you have too many columns in your view, the SQL Server Query Optimizer will require extra CPU time and memory resources to search the possible execution plans. In my customer’s case, we found a staggering 82 LEFT OUTER JOINs in the dynamic query FROM clause. This was because the query contained 656 datetime columns!

In addition to retrieving data to render a list view, updating a list view requires another dynamic query to be built. This UPDATE query will call the stored procedure proc_UpdateListItem once for every row in AllUserData used by the list view. So this is a second query that needs to acquire the compile gateways.

Summary

As it turned out, my customer could only see 16 columns in the list view. That was a mystery at first, but a little bit of digging showed us that all of those datetime columns were hidden from view. We suspected that a third-party tool added the columns programmatically during a site migration. The site was migrated a second time and the columns did not return.

Be mindful of SharePoint’s capacity recommendations (see section “Column Limits”) when you are building these views. We also published an article about designing large lists.

Additional Reading

I hope you found this helpful.

-Michael Garstin