Best Practices for Maintaining SQL Server SAP Systems

The following recommendations are best practices for maintaining a SQL Server Netweaver database for an SAP implementation.

The list is weighted heaviest from a Disaster Recovery perspective.  For most of this audience, such critically important items are already well understood.  But, unfortunately, that isn’t true for those environments where DBA responsibilities continue to be overlooked for any number of reasons (e.g. job transitioning or scoping, work overload, etc.).  This post might perhaps benefit that situation.

The goal of this list is help your SAP-SQL Server environment be more stable and reliable, and for your end users to be happier.

Here are the Best Practice recommendations:

{NOTE – The SAP Note hyperlinks are only available if you have an S-User account for access.}

In Summary

1)  Perform regular FULL database backup

2)  Perform frequent Transaction log backups

3)  Validate your Disaster Recovery Plan

4)  Run CHECKDB

5)  Stay Current on Software Patches

6)  Make sure the system performs well

7)  Review and be Familiar with SAP Note 555223

8)  Establish Standard Operational Procedures

9)  Maintain an Operational Log Book

10) Check logs for errors

11) Baseline Performance

12) Know your chosen High Availability (HA) and Disaster Recovery (DR) solutions

13) Truly Mimic PRD with QA and/or Stress Testing Servers

14)  Back up system databases after configuration changes

15) Get End-User Experience Feedback

16) Use Cloud Virtualization

 

Details

1)   Perform regular FULL database backup

  • SAP databases should be backed up online with a FULL database backup. Users and batch jobs will continue to use SAP applications without problems.  For ABAP stack Netweaver systems, everything that is required for SAP (business data, metadata and ABAP applications etc) is included in one “<SID>” database.   For JAVA stack systems there can be some additional configuration information that is maintained at the Operating System level and should be attended to as well.
  • SQL Server Backup has varying impacts on CPU resources.  When backups are executed using a single process, without parallel targets, and without compression the use of CPU resources is quite light.   But when it is executed in parallel, with multiple targets, and using compression to save space then the impact to CPU
    resources will be quite significant.  And there will always be substantial I/O bandwidth consumption.  Monitoring resource consumption across backup activities is warranted. 
  • To create faster online backups using SAN Technology, SQL Server offers interfaces for SAN vendors to perform a Snapshot Backup or to create clones of
    a SQL Server database.  Backing up terabytes of data every night, however, can overload the backup infrastructure.  Another possibility is to take daily differential backups of the SAP database and to get a full database backup only on the weekend. 
  • Microsoft offers a cloud based backup strategy for your consideration too, as is described HERE
  • System databases are helpful to backup, but not entirely critical for SAP systems.   Should a disaster happen, a fresh SQL Server instance can be installed, the operational database SID restored, users manually created, and msdb jobs rebuilt.   It is easier if you have backups of the master database and msdb, but the critical thing is the operational database.    The msdb system database can be important from a maintenance standpoint if you are using SQL Server’s online backup functionality, for it would contain the history of the FULL and LOG backup events.   The catalog of backups can be resynthesized manually, but that could call for a large amount of work if there is no systematic operational practice that is used for backup tape management.   In either case it is important to validate your Disaster Recovery plan using whatever backup strategy is in place and to understand what is needed to return to normal operations.
  • More information on Backup Strategies can be found in SAP Note 1878886.

 

 

2)   Perform frequent Transaction log backups

  • SQL Server is a Write-Log first Relational Database Management System (RDBMS).  All data modifying statements must be recorded first in the transaction log before the changes are persisted to the data rows inside of the data files.  Because of this it is crucial to protect the Transaction Log (TLog) via backup. 
  • In case of a disaster happening on the production server, it is vital that the most recent status can be restored using full or differential database backups plus a series of TLog backups which ideally cover as close as possible to the time of the disaster and is roughly described here.
  • It is important to perform transaction log backups frequently.  Understand what is meant by Recovery Point Objective (RPO) and Recovery Time Objective (RTO) as they relate to business continuity.  Wikipedia has a pretty good starting point on the topic HERE.  If you only create a TLog backup every four hours, in the case of a disaster, up to four hours of committed business transactions are at risk to not be able to be restored. Therefore TLog backups should be performed frequently enough to reduce the risk of losing a large number of committed business transactions and to satisfy the RPO requirements,  but not so frequent to become an excessive burden if a restore event demanded an exceedingly long series of Tlog files.  Most productive SAP systems show a time frame of 10-30 minutes to be acceptable. In combination with SQL Server Log Shipping, some customers create SQL Server TLog backups even every two or five minutes. The finest granularity achievable is to perform SQL Server transaction log backups scheduled by SQL Agent every minute.  Higher RPO and RTO needs (i.e. shorter time periods) can be achieved by High Availability (HA) solutions as is mentioned later but is outside of the scope of this post.·        Besides reducing the risk of losing business transactions, TLog backups also serve a practical function in keeping the transaction log small.   Backing up the transaction log frees up the internal Virtual Log Files so that the size can be managed and avoid the log file from becoming full. If the SID database is set to operational mode FULL, as it should be, the only way that the transaction log size can be reduced is by backing up the log. 
  • Another potent use of Tlog backups can arise in the unfortunate and rare event that physical inconsistencies would be encountered.   Full database backups copy all of the pages of the data files to the backup image.  This means that if a database has physical inconsistencies in it, those physical inconsistencies will be written to the backup image too.  Most often the Tlog is on a different physical device that is not impacted by the causative event.  In such a situation, if the Tlog chain is intact, it is possible to go back to earlier full database backups (when the database was actually completely consistent) and to apply the entire chain of subsequent Tlog backups spanning one or more of the inconsistent, FULL database backups.   This can be very helpful to keep in mind. 
  • SQL Server Always On Availability Groups offer a means to provide another instance of SQL Server for High Availability purposes.  But for robust DR purposes FULL Database and TLOG backups should still be performed.

 

 

3)   Validate your Disaster Recovery Plan

  • In order to validate your Disaster Recovery Plan (DRP), you need to have one first!   Do you have one?   You should.   A DRP can be created for anything of operational significance to a business.   It is appropriate that all critical SAP systems have DRPs established for what will happen in the event of a disaster.   A disaster being anything from acts of nature (fire, lightning strike, flooding) to manmade (terrorists, disgruntled employees, or simply gross human error). 
  • Test It.   Testing a DRP is the only way to confirm that it works and to learn of weaknesses in the plan. 
  • Time it.   This is helpful information to be able to communicate when a Disaster is being experienced.   It is very calming to be able to state, “The system should be up and running again in about X minutes.”  That message typically lands much better than, “We aren’t certain, but we’re working hard, and it will be up as soon as it can be.”
  • Check it.   As part of your DRP validation efforts, since there is a viable second instance where the testing is happening, use that system to also run CHECKDB to check the physical consistency of BOTH the source SID and the backup image too.   Doing this on a second, separate system will not impact the I/O subsystem of the production system and is an additional side benefit.
  • Document it.   Quality DRPs are written documents to move the DR expertise toward something that the entire IT organization is responsible for.  KEEP THIS DOCUMENT UP TO DATE.  The document needs to be a living document that all changes get recorded for all potential DRP users to be made aware.   The DRP (and its validation) should be driven through the written document. 
  • Teach it.  A good DRP is able to be spread across multiple individuals to avoid any individual being a single point of vulnerability (vacation, sickness, etc.).  It can also be used as a great teaching tool for less experienced IT staff to learn operational procedures.
  • What is involved with a complete Disaster Recovery Plan varies along with the extensibility that SAP provides through its software.   If customers are unfamiliar with this area they should pursue outside expertise to assist them.  
  • If you are working with Azure, then you can also leverage Azure Site Recovery (ASR) to help you with this effort as Cameron describes HERE.

 

 

4)  Run CHECKDB

  • CHECKDB is the only tool to assess the physical consistency of the data inside of a SQL Server database.   More information on consistency and corruption can be found HERE.
  • An FAQ on CHECKDB can be found HERE.
  • Choice of frequency and location vary.   Some customers have a DRP and run CHECKDB as part of routine validation of their DRP.   This has benefits of moving I/O load to another SQL instance and of also checking the backup image itself.  Other customers lack the additional hardware resources, or the system is sufficiently small, or there are periods of light load such that they can perform the CHECKDB directly on the production instance.
  • In principal it should not HAVE to be run frequently, but the only way to know that you have a clean and consistent backup of your database that you can reliably restore from is to restore the backup image and obtain a full DBCC CHECKDB on that restored image.

 

 

5)   Stay Current on Software Patches

  • Stay on Reasonably current SQL Server builds.   Some customers only move to the minimum SQL Server version suggested in SAP Note 62988.   Others always apply the most recent Service Pack and Cumulative Update (CU) as they are available.   Most SAP customers have specific “Planned Downtime” periods for system maintenance and rigorous testing procedures that keep them between these two limits.   This is good.   Always stay above the minimum acceptable version.  If you are having symptoms that 0your own DBA or other support professionals identify as being fixed in a SQL Server build beyond the one you are presently on, move the system to the later build containing the fix confidently knowing that it will still be supported as SAP Note 62988 describes. 
  • Evaluate security patches (and install them if they are necessary) .  For most SAP customers, availability is the critical requirement.  Especially if they need to serve a single SAP instance globally, necessarily restarting the SAP servers to apply security patches is unacceptable.  And some testing in these environments is strongly advised before installing the security patches. Therefore a common scenario for SAP customers is to carefully evaluate patches to reduce the frequency of patch installations. Filtering unnecessary packages and disabling unnecessary services are normal and good security measures {note, however, that rolling updates with Always-on or clustering can help to minimize the impact}.    If you have real time anti-virus monitoring make sure to keep current on the database malware signatures, it is recommended that you exclude the SQL Server database files (including data files, transaction log files, tempdb and other system database files) from real time monitoring. If you perform backups to disks, exclude database backup files as well as transaction log backup files.  
  • Evaluate and update hardware drivers and firmware if necessary.  There have been problematic issues in hardware drivers and firmware. It is sometimes difficult to find this kind of issue within Microsoft, and hardware providers sometimes lack support services for commodity servers without custom support agreements. In such a case, it is the customer’s responsibility to
    monitor and manage updates on drivers and firmware regularly. Before updating the drivers on production servers, thorough tests should be conducted on test and sandbox systems.

 

 

6)  Make sure the system performs well

  • Of course, faster is better for end Users are happier and jobs finish within the expected time.  But additionally, systems that perform better can more easily isolate those things that are causing problems. 
  • I/O subsystem should be good: sufficiently fast, compressed, checksum enabled (see point 16 HERE).   Read latency on data files should be less than 15 ms/Read.   Greater than 20ms/Read will be noticed.    Transaction Log file Average Write Latency should be less than 5 ms/Write.   More information on monitoring can be found HERE.   
  • CPU should have head room.  CPU recommendations can be a little hard to establish owing to differing hardware
    purposes and the diversity of consolidation expectations.   A nice overview about CPUs, sockets, cores, and threads can be found HERE.  On larger systems with many cores, having individual threads intermittently consuming 100% of a CPU can be very appropriate, but as a general rough guideline the total max %CPU should not be targeted to stay below 85%.   SAP sizing tools exist to help with hardware sizing needs. 
  • Network Interface Cards (NICs) should not be loaded over 50% of their theoretical capacity.   If needed add more NICs or upgrade to faster technology.  Strive to use good quality 10GB cards and recent Operating System versions (Windows 2008 R2 is best) if any networking concerns exist. 
  • Memory needs to be sufficient.  Both SAP App Servers and SQL Server database servers enjoy having a lot of physical memory.  For Application Servers ST02 should show little signs of SWAPPING (ideally none) and the Windows Page In rate should be reasonable.  
    On a server that is hosting one or more application servers, the peak value of Committed Memory should not exceed the actual physical memory or else slower paging file bytes will need to be used.   Juergen’s blog post HERE has some information on this that is still relevant.   For database servers the average time for a
    page in cache on the database server should be > 300 {seconds or 5 minutes).  On virtual machine hosting environments (Hyper-V, VMWare, etc.) care should be given to not over commit memory or use dynamic memory management features as is described in the blog post HERE.
  • ST03 Overall TOTAL Dialog Response time should be less than 1 second and database contribution should be roughly 50% of the Total
  • DBACockpit…Performance Database can be used to monitor these too.
  • One side topic that is peripherally related to both performance stabilizing and remote monitoring can be found in SAP Note 2055981 and leads with the encouragement to remove Internet Explorer from a server.  

 

7)   Review and be Familiar with SAP Note 555223

  • This is the main GO TO note for SQL Server information from SAP’s perspective.  It has information on:
    • Credible sources of information
    • Installation and Update guides
    • Administration
    • SQL Server Supported Features and Maintenance
    • Troubleshooting Assistance

 

 

 8)  Establish Standard Operational Procedures

  • SAP systems generally evolve in complexity over the years.  Without establishing proper governance over multiple systems things will get increasingly complex to manage.  And putting those things in place after years where they don’t exist is far harder than working to establish them early on.
  • These areas are where Standardizing things can begin:
    • System Provisioning
    • System Administration and Architecture
    • System Ownership, authorization, and permissions
    • Data Ownership and Control
    • Interoperability
    • Disaster Recovery Planning
    • Change Management
    • Monitoring
    • Auditing

 

 

 9)   Maintain an Operational Log Book

  • An Operational Log Book is simply a repository or log of ALL changes that could potentially affect a system.  From physical operating environment all the way up through end user consumption.
  • The value of an Operational Log Book is largely constrained by the existence of Standard Operating Procedures.   When Standard Operating Procedures are absent, then Change Management processes are not in place and maintaining an operational log book of changes is typically little more than a work of fiction.  
  • One of the most common scenarios that are encountered in support incidents is that a response of “Nothing was changed,” is answered to the question of what happened around that time.   Most commonly, sometimes after prolonged investigation, changes do come to light.   It can help support incidents get resolved
    quicker when operational log books exist.  And most often, when they exist, there isn’t even a need to open an incident, for the problem and its resolution can become readily apparent.

 

 

10) Check logs for errors

  • Inspect logs periodically to look for signs of errors.   The common logs of interest are: 
    • SAP centric: ST22, SM21, DBACockpit…Performance
      Overview…Alerts, DB13 Job logs
    • SQL Server ERRORLOGs & CCMS CHECKDB log files
    • Windows Event Logs (Syst, App, and Sec) & Cluster Logs (when they exist)
    • Hardware component logs

 

 

11)  Baseline Performance

  • Often overlooked by all but the best maintained systems, is the value in obtaining a performance baseline when things are running well.  It is an excellent best practice.    
  • A common driving point for recording baseline performance data originates with the need to validate Service Level Agreements (SLAs) for any number of things.   Usually these start at the business level with some important Key Performance Indicators (KPIs) like a Maximum time for some critical batch job to complete or an average Dialog response time over some period of time.   These higher level KPIs then drive the need for tracking lower level, more granular SLAs variables that can impact the KPIs.   And then the need to capture a performance baseline arises to see what the normal, “good running” state looks like.
  • As much performance and load data should be obtained and retained as possible:  ST03, ST02, DBACockpit. Windows Perfmon, Hosting and Hardware performance metics.  These all can help validate SLAs and help to identify problems in the future.  
  • Performance baseline capturing should be done after any substantial changes are made.  Ideally the baselining should have at least 3 different periods as a minimum to establish statistical significance.   From a Statistical Process Control (SPC) or Statistical Quality Control (SQC) perspective this baselined information provides the variance that can be used to determine when things are significantly out-of-control.  Analysis of trended data is always more valuable than static data. Some customers are always tracking this information (like Perfmon) and then the need to just isolate the stable, good performing periods, and to roll up and track the statistical metrics in a spreadsheet somewhere to be able to quickly compare baseline changes to each other.  SAP’s DBACockpit has performance database that has much of this metric in it, but doesn’t have the external assessments of when things are running normally or not.  When there is no other data to work with, the performance database is usually the best.
  • Having a baseline of performance avoids the generic, “The system is running slow and needs to run faster” and leads to the tangible: “The system is now running with X metric 80% slower and Y metric 50% slower.”  The latter helps clarify both the problem and the target solution state for problem resolution.
  • The experts living in a system will typically “know” this.   But the hard data of a statistically significant performance baseline is helpful to communicate that information.

  

 

12)  Know your chosen High Availability (HA) and Disaster Recovery (DR) solutions

  • The Microsoft HA and DR solutions that SAP systems use to maximize uptime are:
    • Always On Availability Groups as described in Juergen’s blog series beginning HERE and in SAP Note 1772688.
    • Mirroring as mentioned in SAP Note 965908.
    • Log Shipping as mentioned in SAP Note 1101017 and Note 2021980.
    • Windows Server Failover Clustering {previously known as Microsoft Cluster Server or MSCS} in general HERE and in SAP Note 1257296.
    • Combining more than one of these solutions, for instance Mirroring and Log Shipping, as was described HERE to reduce downtime during upgrades.  Or with Always On and Log Shipping as is described HERE.
  • Know what they can do.   Know what their limits are.  Know how to respond when things break either partially or completely.   They should be part of the DRP.
  • Understand that these are solutions to provide HA.   They do NOT eliminate the need for backups, for this is for true disaster response capability.  Regardless of what HA solution(s) are in place, still continue to get backups as needed to meet Recovery Time and Recovery Point objectives.  

 

 

13)  Truly Mimic PRD with QA and/or Stress Testing Servers

  • Each SAP system is mandated to have a PRD, QA, and DEV system associated with it.  This is a normal minimum operational landscape.
  • The point here, however, is to have an actual mimic (as close as possible) to all componentry that exists in the PRD system.  It is a wonderful, but expensive option when the QA or Test system is a true duplicate of the PRD system.  But sometimes the potential impact can justify the cost. 
  • Stress Testing Servers are great for being able to quantifiably assess the impact of expected future growth.  Few customers use them, but those that do, seldom run into problems where support assistance is needed.   Or if they do, it is not on production systems, so there is increase calmness while investigating issues.  They should perhaps refer to these as “Reducing-Stress” Testing Servers.
  • Some customers also have additional systems that they explicitly use for testing immediately before moving things into PRD (e.g. PRD-1).  In such systems the Change Process surrounding the movement from QA into PRD can then be tested.  Cost is an issue.  But how critical is system uptime around planned system outages?

 

 

14)  Back up system databases after configuration changes

  • Back up the SQL Server system databases (master and msdb) after any configuration changes.
  • System databases are helpful to backup, but not entirely critical for SAP systems.   Should a disaster happen, a fresh SQL Server instance can be installed, the operational database SID restored, users manually created, and msdb jobs rebuilt.   The msdb system database can be important from a maintenance standpoint if you are using SQL Server’s online backup functionality, for it would contain the history of the FULL and LOG backup events.   The catalog of backups can be resynthesized manually, but that could call for a large amount of work if there is no systematic operational practice that is used for backup tape management.   
  • This can make restoration from disaster events go a little more quickly, but as mentioned above it isn’t all that critical since the logins and users can be manually recreated and the jobs can be rebuilt as is described SAP Note 1817705.
  • Whether you use backup/restore of SQL Server system databases or not, make sure to validate you DRP to ensure complete familiarity with how to return the system to complete functionality.

 

 

 15) Get End-User Experience Feedback

  • Although this item is near the end of this list, it should be first in focus.  For serving users in a reliably functional, performant, and consistent manner should be at the heart of why the system exists.  All of these Best Practice efforts are centered on empowering users either directly or indirectly.
  • Intermittently, directly connect with some of the end users of the system.  It is not uncommon for them to identify significant and unexpected problems.  And frequently they are able to shed light on an item while it is still at the “nuisance” level, before it transitions into a critical situation.  Resolving such issues while still in a
    nuisance state is better for everyone involved.
  • On engaging with the end-users it is important to approach them respectfully and with a sense of care for their experience and any input they might offer.   This is obvious to most, but unfortunately is not to everyone in role.   Insensitively or incorrectly striving to get feedback from End-Users can do more harm than good.  Respect, listening, empathy, understanding, caring, and follow-through are some of the helpful traits to express.

 

 

16) Use Cloud Virtualization

  • The Cloud {from Microsoft’s perspective: Azure} is here and will only be expanding in influence and usage.  It brings much power to the realm of system provisioning and dynamic cost reduction.   It provides the opportunity to spin systems up and down as needed.  It offers the opportunity to shift computing resources from being on-premise, IT-managed capital resources to being purchased operational services operated externally to the business.   Essentially shifting the burden of IT service ownership outside of the walls of a business.   More on Azure can be found on wikipedia HERE or from Microsoft HERE.
  • If you aren’t already doing so, it is recommended to get familiar and learn about the technology.
  • Consider moving DEV and TEST systems to the cloud as described HERE.
  • Leverage backing up SAP systems using Azure as described HERE.
  • This is a best practice because it is the most significant change to the SQL Server/Windows – SAP technical arena in the past couple decades.  The technological shift is underway and will continue to evolve and be more central to the realm of system maintenance in the future.   It is only through deep familiarity with technological advances that one can see with clear insight on how best to leverage them appropriately.