SQL Server Assertions and Dumps – a typical example and some proactive help

I recently worked with a customer who was seeing the following error in their error log:

SQL Server Assertion: File: <qxcntxt.cpp>, line=1069 Failed Assertion = 'cref == 0'
Could not find an entry for table or index with partition ID 491897996509184 in database 2

As you would expect a mini dump was produced at the same time, which detailed the stored procedure that was being run when the assertion occurred.

In the case of such errors I would always advise you to contact us here at PSS so that we can analyze the dump and provide an explanation or solution. However this particular case reminded me of an interesting side issue where you can proactively help yourself, before contacting PSS.

You should consider running SQL Server Best Practice Analyzer on your server. You can download it from here. We regularly see cases where the running of the tool could have identified known issues in advance. When you run the tool initially, it checks for updates to the tool itself, but more importantly it checks for updated configuration files. For your reference (and if you're nosey like me) you can look at the configuration file it has downloaded here:

C:\Program Files (x86)\Microsoft SQL Server 2005 Best Practices Analyzer\EN\SqlBPA.config.XML

(removing the x86 from the directory path  if you don’t have a 64 bit installation of windows)

Within PSS nowadays we add rules to this tool when we find interesting new issues or where we find common causes of problems that our customers experience. If you run the tool regularly you’ll pick up our latest updates, things that we in PSS think are important you should check for, and check for them automatically. A large number of the support cases that I work on contain known issues of configuration and the like which could be avoided or at least highlighted by this tool.

The reason that I mention this now is that the tool currently contains a rule for a known issue with a driver from Dell which can cause random data corruptions which show up as assertions and errors similar to the one above.

For reference the driver in question is LSI_SAS.SYS where the version is less than 1.21.26.01. If you notice that you are running an environment that has the matching configuration options ( >4GB of system memory and on servers with a SAS backplane, Microsoft Windows 2000 or 2003 and Dell OpenManage Server Administrator Storage Management Software) to be susceptible to this problem, you can download the update which resolves it here.

It actually turned out that this was not the cause in this case, but it’s always good to check these things, and the BPA is a great way of doing it.

The key to actually finding the cause of this particular error was to note the rest of the SQL Server error logs, which were littered with errors 602, 676, and 677. For example:

Error: 602, Severity: 21, State: 30.
Could not find an entry for table or index with partition ID 491897996509184 in database 2. This error can occur if a stored procedure references a dropped table, or metadata is corrupted. Drop and re-create the stored procedure, or execute DBCC CHECKDB.

Error 676
Error occurred while attempting to drop worktable with partition ID 491897996509184

Error 677
Unable to drop worktable with partition ID 491897996509184 after repeated attempts. Worktable is marked for deferred drop. This is an informational message only. No user action is required.

In themselves, these errors are not completely conclusive, as further memory dumps may be produced, which again would also need to be analyzed. However sometimes they are caused by this problem https://support.microsoft.com/default.aspx?scid=kb;EN-US;937343

If you are getting errors such as these, or others like them, you should consider what the errors themselves recommend. You should run DBCC CHECKDB to check your data integrity (a data corruption can cause all manner of different assertions depending upon its nature) and you should check your hardware for possible problems as well.

I guess what I’m trying to say in conclusion is that there doesn’t have to be one single simple cause when you look at particular errors such as these. The basics still always apply;

You should run DBCC CHECKDB regularly (this is another BPA rule it will highlight it if you haven’t)
You should keep an eye on the overall health of your environment, especially the storage
You should keep an eye on your SQL Server error logs and consider all the events together.

The answer to this actual problem? Well it was the hotfix noted above, but we had to collect some dumps to prove this definitively, and that’s something that you do need PSS involvement with.