WORST Practices for Maintaining SQL Server SAP systems (i.e. Things to AVOID)

Database and System Administrators are frequently and appropriately risk averse.  Thus the best practices are sometimes less interesting to this audience than the alternate list of things to be avoided, or “worst practices”.  There is overlap in the two lists, but the lists are not identical.  

One can think of all the tasks performed by those responsible for providing an SAP system for the business and how they go about doing those tasks.   The list varies by business and IT environment and with the skills of the individuals performing those tasks.  But if all those tasks were listed and prioritized in a spectrum from Best Practice to Worst Practice these two articles represent the two ends of that spectrum: the top and the bottom.

Inspect the Best Practices post first.  Some helpful context is offered that can add further clarity.  The rank ordering of these things to avoid comes from a combination of the significance of the impact and the frequency that the problem situation is encountered.

So, the THINGS TO AVOID when striving to maintain a SQL Server Netweaver SAP system are…

{NOTE – The SAP Note hyperlinks are only available if you have an S-User account for access.}

In Summary

1)   Deleting the Transaction Log

2)   Not Testing BACKUPS and validating the Disaster Recovery Plan

3)   Neglecting BACKUPS because of High Availability usage

4)   Using Repair Allow Data Loss (or RADL) to ‘clean up’ corruption

5)   Allowing the TLog to Fill up

6)   Not managing your data files properly

7)   Directly modifying the PRD system

8)   Not monitoring logs for significant errors

9)   Neglecting to review and be aware of the SAP Note 555223

10)   Not Maintaining an Operational Log book

11)   Operating without governance and control.

12)   Modifying the environment without being aware of impact

13)   Defragmenting or Rebuilding Indexes

14)   Manually Updating Statistics

15)   Staying on old versions of software

16)   Not engaging End-users

 

 

Details

1)   Deleting the Transaction Log (TLog)

  • Unfortunately, inexperienced individuals can mistakenly delete the transaction log of a SQL Server database. This usually happens either by incorrectly moving the transaction log file or more commonly by an incorrect response to a “Transaction Log is Full” error message.  The Transaction Log should not be deleted
    except under extreme situations from the explicit advice of a SQL Server support expert.
  • SQL Server is a Write-Log-First (aka Write Ahead Logging or WAL) Relational DataBase Management System (RDBMS).  Every committed data manipulation statement (Insert/Update/Delete) will result in a synchronous write to the transaction log before the application will receive an acknowledgement that the statement has been completed.  It is the Transaction Log (TLog) of SQL Server that provides the ACID guarantee of any committed transaction.
  • On SQL Server startup or when a database goes through recovery, the transaction log is read and any UNDO and REDO work is performed to re-establish full transactional consistency in the database.  More information on recovery and UNDO and REDO can be found HERE.
  • While SQL Server is running and the database of interest {like an SAP <SID>} is open the TLog can’t be easily truncated.   Transactional application consistency is very important to an enterprise database like SAP and without transaction log integrity one cannot be sure of the state of the data that the database is managing.
  • More information on consistency and physical inconsistency can be found HERE.

 

2)    Not Testing BACKUPS and validating the Disaster Recovery Plan (DRP)

  • All customers need to take backups and should take TLog backups on PRD too.
  • Most customers correctly do so.
  • Only a portion of the customers that perform backups, however, actually take the time to restore the backup onto other hardware and get a full CHECKDB on that restored image.  It has happened, on rare occasion, that backups are working successfully while writing data through to faulty devices such that the backup images are themselves corrupt and cannot be restored.   You don’t want to be a customer needing to restore your backups, but having no viable backup to restore from.
  • And only an even smaller portion of those, do that testing within the context of a broader Disaster Recovery Plan validation effort.
  • GET BACKUPs {FULL, optionally Differential, and TLOG}
  • TEST BACKUPs to make sure they are able to be restored
  • Check them with CHECKDB to be sure they are consistent
  • Work this up as part of a broader DRP validation effort.

 

  

3)    Neglecting BACKUPS because of High Availability (HA) usage

  • Many customers choose to implement SQL Server High Availability Features such as Always On, Mirroring, Log Shipping, or Windows Failover Cluster.   Great solutions that provide sound value. 
  • Unfortunately, some fail to realize that any of these HA solutions do not reduce the need for Disaster Recovery planning.  That is database backups, both FULL and TLog, still need to be performed.

 

  

4)    Using Repair Allow Data Loss (or RADL) to ‘clean up’ corruption

  • RADL is discussed HERE.   Occasionally cases come in where RADL has already been used and assistance is being sought when no backup exists.   This is a sad situation.
  • Also read the Consistency and Corruption blog post listed HERE.   The impact to Logical inconsistency should be thoroughly understood.
  • RADL should only be used as a last resort and only after review and from the recommendation of trained support professionals.

 

  

5)    Allowing the TLog to Fill up

This is serious in so far as the number of customers that allow it to happen.  It leads to a work outage.  It is usually sourced in 3 main reasons:

  • TLog backup job failure
  • Database maintenance work that didn’t take into account the impact to the Transaction Log
  • Long running open transaction preventing the transaction log to be truncated

 

 

6)    Not managing your data files properly

  • The database files should be laid out as planned in the SAP installation guides.
  • SQL Server Trace Flag 1117 should be set so all data files are grown at the same time as is referenced HERE.
  • Disk space should be monitored to make sure that there is always sufficient space to grow.
  • Single data files should generally not be added after initial creation since most new data will go into that empty file and the data files will cease to grow proportionately over time.   If an additional file is needed, then one option is to grow all of the existing files so that there is X amount of free space in each of the files and add a new file with that same amount of X free space.  Alternatively, if this can’t be done, an SAP homogeneous system data export followed by a re-architecting of the files as desired and then a re-import of the data will work to achieve a final data balanced situation.   This latter process can be fairly time consuming.

 

7)    Directly modifying the PRD system

  • This is typically a sign of inadequate system administration.  
  • SAP advocates, and most customers have, a QAS and DEV system aligned with each PRD for a particular system.  But a surprising number of customers move changes into PRD without testing or worse directly modify PRD itself.  Striving to maintain such a system is akin to the task of attempting to herd cats.
  • All code should be run through QAS first, before execution happens on PRD.
  • And preferably, when it exists, a STRESS Server too.  
  • And the best maintained system will even have a PRD-1 system where they can actually duplicate the whole process of the actual rolling of changes into PRD before that  actually happens.

 

8)    Not monitoring logs for significant errors

  • There are many logs available throughout the technological stack that should be reviewed.  It is a bad thing to not know where these are located and to occasionally review them:
    • ST22, SM21, RZ20, DBACockpit, SQL ERRORLOGs, and the CCMS DBCC CHECK log file
    • Windows Event logs {Syst., App., & Sec.}, and Cluster logs (when they exist)
    • hardware, and hosting system logs

 

  

9)    Neglecting to review and be aware of the SAP Note 555223

  • There is good information on:
    • Installation/Upgrade/Administration
    • Features and Maintenance
    • Troubleshooting

Especially the reference to SAP Note 2093615, to get the fastest resolution to your SQL Server incident

 

 

10)   Not Maintaining an Operational Log book

  • At least for every PRD system that is significant for the business.
  • All changes should be logged.  As many as might be impactful.

 

 

11)   Operating without governance and control.

  • Chaos and random events are inevitable.  But a great deal of human sourced unexpected events can be avoided by control and standard operating procedures.  When these are lacking, then systems have increased potential to be impacted by those with less skill or from incorrect decisions being made in any given moment.
  • This is a rather significant topic and one that belongs in the upper part of this list.   But there are, unfortunately, extenuating circumstances that lead to this being a known and tolerated situation.

 

12)   Modifying the environment without being aware of impact

  • Although this is related to point “7)” and could be covered by it, it is a little different.  These are generally external things that typically arrive by corporate mandated standards or via implementing 3rd party SW solutions.
  • For various reason, such things seem to sometimes fall outside of the normal scope of testing to understand the complete impact to DR and performance.
  • Effort should be made to inclusively test these environment influences.

 

 

13)   Defragmenting or Rebuilding Indexes

  • Many SAP application indexes can be highly fragmented.  That is fine and to be expected.   Rebuilding of indexes, by and large, doesn’t need to be done on an SAP system.
  • As a general rule, it is a practice to avoid in rebuilding indexes to eliminate the fragmentation.   It will be able to demonstrate an immediate performance benefit.  But over time the state of the table will generally just return back to its expected fragmented state.
  • Most SAP tables and indexes where fragmentation is observed do not need to be defragmented or rebuilt.  
  • The focus should be to center on performance.  In most cases an additional secondary (or search) index will correct the performance problem.
  • If in doubt, open a performance issue with SAP.

 

 

14)   Manually Updating Statistics

  • Microsoft SQL Server has an automatic update statistics feature that works quite well.   SQL Server is targeted to be easy to use and to not need much administrative attention.  Manually Updating Statistics generally doesn’t need to be done.
  • SQL Server caches plans to reduce the compilation time to generate the plan.

Sometimes a plan can exist in cache that was properly built for an initial set of parameters that is not the best solution for the same statement coming in with a different set of parameters at a later time.   In such a situation updating statistics will generally fix the problem.  But note, this is more commonly NOT because the statistics were wrong or even outdated but because the parameters were simply atypical and not fitting.  By updating the statistics, the cache for that table(object) gets invalidated and a new plan will be rebuilt.  When suspecting that old/stale/wrong statistics might be a problem:

    • Isolate the query
    • Get timing performance metrics on i
    • Empty the cache using either sp_recompile or DBCC FREEPROCCACHE, but do not update statistics.   {NOTE – sp_recompile is good in that it will invalidate the plans on only one specific object/table.  But it can sometimes get blocked on very heavily accessed tables and not lead to invalidating the problematic plan.   FREEPROCCACHE will not get blocked, but it will invalidate all of the statement plans for all objects on the server.   So the difference is between a small, precise “Rock Hammer” and a “Big Ol’ Sledge Hammer”}
    • Repeat the problematic query at least three times to and get timings again
    • Then update statistics and repeat the timings a third time

This should help you know if you really have a statistics problem or not.

  • Before considering manually updating statistics: if a large table is involved consider the trace flag 2371, which is recommended and is described HERE.  If BW is involved, make sure that you are on recent versions of SAP support packages as mentioned in SAP Note 783750.  Also check for SAP Notes for the involved table or function.  If you still see some specific need that warrants manually updating statistics, open an SAP performance issue with BC-DB-MSS.   SAP is infinitely extendable and there can be reasons where this might be of assistance, but by-and-large it shouldn’t be needed.

 

 

15)   Staying on old versions of software

  • It takes time and resources to test and go through upgrades.   The frequency these need to be performed is something that each organization determines on its own.  
  • But staying at the very tail end of supported versions is generally a bad policy to follow.  Specifically:
    • Stability enhancements
    • Performance improvements
    • Bug fixes

All these things generally advance as code moves chronologically forward. This is regardless of what additional features are offered.

  • If a fix is called for, in SQL Server, it is best to be on the recent version of the product for that is the code line where developers will have the most familiarity and a fast response will most likely be realized.
  • Within reason, strive to stay current on your software patching.

 

 

16)   Not engaging End-users

  • In other words, spend some time in dialog with end-users, if it is permitted.
  • It is a worst practice to stay isolated entirely within the realm of the technological stack that makes up the SAP-SQL Server system.   Although it is fitting and right to spend most time and effort where ones skills are needed.   The end users that are consuming the provided services need to be remembered and actively engaged.
  • Not engaging end-users can lead to unnecessary worsening of incidents, because problems weren’t identified as early as they might have been.
  • Overall perception of the system functionality is able to be affected by human interaction.  Both positively and negatively.   On engaging with end-users it is usually best to consider them with much respect and with fully engaged listening skills.  Engaging end-users with arrogance, defensiveness, and a lack of empathy and sympathy will usually do more harm than good.