24 hour operation, several thousand mailboxes, high concurrency, heavy profiles – how do you tune Online Maintenance?

For companies who have to support a lot of ‘heavy usage’ clients connecting to their Exchange 2007 platform 24 hours a day, tuning online maintenance is going to be important..maintenance

You have to be quite careful about staggering online maintenance to minimise its impact on performance and it can be a bit of a juggling act.  You might find you have to change your database configuration a number of times before you get it right.  This blog outlines an approach to tuning online maintenance (OLM) and is most relevant to big deployments with several thousand mailboxes per CCR node but with a constant low concurrency throughout the day and night.  I’ll break this blog up into 3 bits:

  1. What is OLM and which bits affect what
  2. Best Practise to Follow
  3. What to Monitor

What is OLM and which bits affect what I’m going to focus on the 3 main parts of OLM which have the greatest impact on performance; Online Defrag (OLD), deleted item\mailbox purges and, if you enable them; database checksumming & page zeroing.

OLD – online defragmentation is essentially recovering ‘deleted’ pages as whitespace for reuse.  This will lead to an increase in both read and write I\O. This will translate into an increase in RPC latency experienced by an Outlook client or a CAS proxying a request on a clients behalf, in this instance specifically for read operations.

Database Checksumming – not on by default but should be enabled when using CCR and VSS based backups. (checksumming would normally be done during a streaming backup. If you’re now taking backups of the passive node using VSS the active database is never being checksummed – risking bitrot.)   The main impact of database checksumming on the performance of a mailbox role server is mainly on read I\O against the database disks. Similarly to OLD, this will translate into an increase in RPC latency experienced by an Outlook client or a Client Access role server proxying a request on a clients behalf, in this instance specifically for read operations.  Microsoft IT sees an 10% increase in RPC Averaged Latency as a result of database checksumming which is not inconsiderable.

Page Zeroing – again not on by default but is a security option that prevents recovery of data that has been deleted.  Like database checksumming, this was normally achieved during a streaming backup and so options were introduced in Exchange 2007 RTM and then again in SP1 to provide the ‘Zero Database Pages During Checksum’ option ensuring that page zeroing was still possible when taking VSS based backups and to provide additional security to every copy of your data.  Some examples of the impact of page zeroing can be found here, and one useful guidance is that the impact of the first pass can be considerable which is why it is recommended to turn page zeroing on when you first deploy your servers or make sure that the first pass is out of hours and storage groups maintenance periods are staggered.

For subsequent passes the impact will be on read I\O, there will be a slight processor increase and you can expect RPC latency to rise by as much as 20%.

Deleted item\mailbox purges - the performance impact of item and mailbox purge is likely to be minimal in relation to OLD and checksumming processes and will largely depend on the rate at which messages and mailboxes are deleted. There are two likely impacts of item and mailbox purge:

  • Increase in directory lookups
  • Increase in disk I\O

The impacts are likely to be minimal in relation to database checksumming and OLD and will take place at the beginning of the OLM window. Although minimal the impact is likely to translate into an almost negligible increase in message delivery times, client logons and client response times.

Best Practise to Follow
What best practise guidelines do you need to follow?

OLD - There are two to ensure that OLD is operating efficiently: 1. OLD completes a full pass of each database within 2 weeks; and 2. ‘database pages read per second’ to ‘database pages freed per second’ ratio remains between 50:1 and 100:1. (If the Read:Freed ratio is greater than 100:1 then the OLD window should be reduced. If the Read:Freed ratio is less than 50:1 then the OLD window should be increased. The ratio should remain within 50:1 and 100:1 across each database pass.)

Database Checksumming - Database checksumming should be allowed to complete a full pass across each database within 2 weeks. ..database checksumming will consume up to half of the available maintenance window before the OLD process is allowed to start.

Page Zeroing – I think the best practise here would likely come from your security team. The need to turn on page zeroing will be to satisfy a specific security recommendation and so the frequency of zeroing out the deleted data may come from them. As a rule though I would follow the 2 week checksumming guidance.

Deleted item\mailbox purges - Whilst the deleted item and mailbox purge does have a performance impact (particularly on queries of Active Directory) the process generally completes very quickly in relation to OLD and database checksumming. Ensuring that OLM completes successfully will be sufficient to ensure that the item and mailbox purge is completing.

What to Monitor

A combination of Event Viewer events and Performance Monitor data is required to monitor OLM effectively. You might want to collect at least a couple of weeks data before making any major changes to OLM.

Event Logs - the following events should be monitored:

  • OLD started (Event ID: 700)
  • OLD completed (Event ID: 701 or Event ID: 703)
  • Database checksumming started (Event ID: 717)
  • Database checksumming background task completed (Event ID: 721)
  • Database zeroing background task started (Event ID: 718)
  • Database zeroing background task completed (Event ID: 722)

For example:

Event Type: Information
Event Source: ESE
Event Category: Online Defragmentation
Event ID: 703
Description:
MSExchangeIS (19052) SG01: Online defragmentation has completed the resumed pass on database 'e:\MDB01\priv01.edb', freeing 42764 pages. This pass started on 6/16/2008 and ran for a total of 124918 seconds, requiring 7 invocations over 5 days. Since the database was created it has been fully defragmented 14 times over 73 days.

Event Type: Information
Event Source: ESE
Event Category: Online Defragmentation
Event ID: 721
Description:
MSExchangeIS (6584) First Storage Group: Online Maintenance Database Checksumming background task has completed for database 'J:\sg1\priv1.edb'. This pass started on 6/19/2008 and ran for a total of 208 seconds, requiring 2 invocations over 1 days. Operation summary:
5860768 pages seen
0 bad checksums

Event Type: Information
Event Source: ESE
Event Category: Online Defragmentation
Event ID: 722
Description:
MSExchangeIS (6544) Third Storage Group: Online Maintenance Database Zeroing background task has completed for database 'J:\sg3\priv3.edb'. This pass started on 6/20/2007 and ran for a total of 369 seconds, requiring 1 invocations over 1 days. Operation summary:

5850768 pages seen
0 bad checksums
72681 uninitialized pages
4379723 pages unchanged since last zero
33759 unused pages zeroed
1210764 used pages seen
57214 deleted records zeroed
0 unreferenced data chunks zeroed

You could use Powershell to make it easier to collect the events..  Use the following to start:

$Event701 = Get-EventLog "Application" | Where-Object {$_.EventID -eq 701}
$Event701 > c:\temp\Event701.csv

Performance Monitor - should be used to monitor the read to freed page ratio of the OLD process to ensure that it remains above 50:1 and beneath 100:1.

  • MSExchangeDatabase -> Online Defrag Pages Freed/sec
  • MSExchangeDatabase -> Online Defrag Pages Read/sec

(Make sure you enable the extended performance monitor counters to see the above – have a look here for more information.)

..and to track the progress of page zeroing use the following counters:

  • MSExchange Database -> Online Maintenance (DB Scan) Pages Zeroed
  • MSExchange Database -> Online Maintenance (DB Scan) Pages Zeroed/sec

So in conclusion understand the impact, monitor the main processes for a couple of weeks and tune the maintenance window until the processes are completing within best practise…  And although this information is most relevant to the bigger deployments it’s still important for all implementations to monitor database maintenance to make sure your databases are operating efficiently.

Want more information on the above?  ..by the power of live search:

How to Monitor Online Defragmentation
Exchange 2007 Online Maintenance Database Scanning (Part 1)
Exchange 2007 Online Maintenance Database Scanning (Part 2)
Exchange Server 2007 SP1 ESE Changes - Part 1
Exchange 2007 SP1 ESE Changes - Part 2
New Mailbox Features in Exchange 2007 SP1 - Enhanced Monitoring of Online Defragmentation