Subscriptions gone crazy, a case study example from the SQL Dev Team

Hey There -

We have a production instance of SQL Server Reporting Services to support the reporting needs of the SQL Business Unit. Jaime Tarquino on my team now owns the server, and we ran into an issue a couple weeks ago I worked with him on solving. It's been a while since I blogged so I thought I would share this issue with you in hopes you can learn from it, and let you know that we in the dev team also understand the experience of dealing with production system issues that affect our business and our regularly scheduled work.

We came to work for a busy day of meetings and test stuff as usual, but on this day we started seeing emails from users saying "why did I get 8 copies of this report in email today?". Usually subscription problems manifest themselves by not getting delivered to people's InBoxes, now they were getting 8 (EIGHT!) versions of the same report. So we knew something new was up.

Our implementation of RS is configured as a 2-node NLB farm with the RS instances installed on separate machines, and the catalog database on a 3rd machine. We typically run the latest released version of RS on the box, or something close to it, which includes CTP, RTM, and SP releases of RS, so we upgrade the server several times per year. During such an upgrade we typically follow a process similar to:

  • Make a backup of the RS catalog database and the encryption key from the production machine
  • Restore the catalog database to SQL on another machine
  • Install the original, production version of RS on the test machine and configure it to use that instance of SQL Server for the catalog, and restore the encryption key
  • Verify a few reports and other features are working correctly
  • Upgrade RS to the new version
  • Verify a few reports and other features are working correctly
  • If so, perform the upgrade process on the production instance

As we had just finished up the release of SQL Server 2012, we had followed these steps to upgrade our production server, but ran into an issue in step 5, "Upgrade RS to the new version". The old version of RS was an internal build and was only a minor version upgrade, so setup could not upgrade the instance, and we had to abort the upgrade process until the next week.

We figured our current problems were related to something that happened during the upgrade testing, and we started investigating the issue using some of the techniques I've covered in this blog. The first thing we noticed was that there were a lot (1000's) of subscriptions queued up in the Events table to be processed, and there were thousands of entries in the trace log file with this error:

"Schedules, events, and subscriptions processing may be delayed because processing queue reached the maximum item count = <nn>. Items currently being processed: <nn>."

We added the logging of this error in our previous release to help identify when the queue is not getting processed.

Both of these are a clear sign that something is not right, but it only explains why reports would not be delivered, it does not explain why 8 copies were being delivered. As an interim solution to avoid the "why am I not getting my report?" emails from our users, we finally deleted the older entries from the Events table, it was clear that the backup was so large that RS could not keep up with them, and it was even falling farther behind.

Along the way, we'd noticed a few other facts that did not seem relevant or related to the issues at the time:

  • There were entries in the trace log file for successful subscriptions (based on their schedule IDs), but not all had the records in the execution log table we would expect from the execution of the subscription's report
  • One of the RS server's trace log file had excessive numbers of "Logon failure: unknown user name or bad password" errors
  • One of the rogue subscriptions had been disabled two weeks prior
  • Of all the subscriptions (100's) on our production box, only a few were delivering extra copies of the report

Well into the evening Jaime noticed that one of our instances of RS was configured to use the instance of SQL Server with the backup copy of the catalog database, not the production catalog database. This was the smoking gun. We have a new guy in our IT department that owns and supports the systems, it was a common human error. Things started to fall into place; it was pretty clear that we had two instances of RS processing the same subscription in two different catalogs.

This would explain one extra version of the report being delivered, but not 8, and and why were the old subscriptions still being processed?

As it turns out, the backup of the catalog database used for testing was fairly old, which explains why the old subscriptions were still firing. And apparently SQL Agent had been running on the machine for a while when RS wasn't, which put a backlog of entries in the Events table that were not being processed by RS. And when the RS service using that database was started, it began processing the backlog Event table entries for old subscriptions.

Also, the reports from the rogue subscriptions were all using data sources with SQL Authentication and SQL logins that did not have expiring passwords. The rest of the subscription/report datasources used Windows authentication, and the passwords in those data sources had all expired. Which was probably a good thing for us in the RS team; otherwise there would have been literally hundreds of users in the SQL Business Unit and elsewhere in Microsoft getting peppered with old and duplicate email subscription deliveries.

The remaining issue was the cause of the backlog of unprocessed subscriptions in our production catalog database. To be honest, I'm not sure I completely understand the cause of that issue; we lost a lot of evidence when we manually deleted the Events table, so the rest here is mostly speculation and educated guesses that I'm confident can explain the issue.

We did recall that several of the entries in the Notification table - for subscriptions currently being processed - in the production catalog database had been there for hours. We've had problems with runaway queries in reports in the past, basically infinite loops in report dataset TSQL code, so several of the threads had been tied up for a long time. It was not likely related to the upgrade process.

Our RS installation normally uses and needs two instances of RS to keep up with both subscriptions and ad-hoc reports during busy parts of the day. A single instance can successfully struggle to keep up, but performance does suffer and this likely contributed to the problem.

For the subscriptions that were running simultaneously on both instances of RS, the instance of SQL Server processing the same dataset queries in parallel likely suffered from locking and contention and that caused those reports to take longer than usual in the subscription.

And finally, many of the queued entries in the Events table were from data driven subscriptions. Data driven subscriptions will put one entry in the Events table for each row in the DDS query, so we likely had some unlucky timing relative to a heavy DDS schedule.

So in case it's not obvious, the lesson we learned is that when you are using a parallel test system like we do, make sure that the production and test RS server instances are both configured to use the correct instance of SQL for the catalog database.

I hope you find this interesting and useful.

Thanks, Dean

This posting is provided "AS IS" with no warranties, and confers no rights.