More on Service Level Agreements...

My recent post on SLAs prompted some interest and comments from readers so this is a follow-up to that post.

What most people wanted was a list of some SLAs applicable to SQL Server - easier-said-than-done because a lot of SLAs depend on the application being serviced by the database. I had a poke about the web and have pulled together some examples to get you thinking about SLAs you may want to define and strive to meet, listed by category - this is definitely a non-exhaustive list!

Hours of Operation

  1. Hours that the partition/table/database is available to users. The SLAs may be different for various parts of the database, depending on what applications need access to them. Why is it important to differentiate at various granularities? For example, depending on how the unit of data is used, it may require different maintenance than other data and so knowing the availability SLA allows maintenance downtime to be planned.
  2. Hours reserved for planned downtime. Again, this may differ at various granularities of data.
  3. Amount of advance notice for extended downtime or other changes that affect users. For instance, when my bank upgraded its computer system last year, they gave a series of warnings over the preceeding few months so that people weren't surprised.

Service Availability

  1. Percentage of time SQL Server service is running and able to service connections.
  2. Percentage of time a particular partition/table/database is available for use (i.e. not exclusively locked for maintenance or restore).

System Performance

  1. Number of concurrent users the system supports.
  2. Number of transactions supported per unit of time.
  3. Acceptable level of performance, such as latency experienced by users for a variety of operations.
  4. Minimum time for an update to be replicated to various remote sites.

Disaster Recovery

  1. Time allowed for recovery from each type of failure (e.g. accidental data deletion, database corruption, SQL Server crash, OS crash, server failure, site failure).
  2. Time it takes to bring critical data online (e.g. the read/write partitions of a sales database) such that operations can continue and less critical data can be recovered later.
  3. Time taken to recover data to the point of failure.
  4. Maximum acceptable data/transaction/work loss for various kinds of failures.
  5. Maximum time for application failover to a remote server/site.

Support

  1. Methods available for application users to get help.
  2. Maximum response/resolution time from a DBA to respond to various types of problems.

Other

  1. Maximum amount of space for user tables/databases.
  2. Amount of users in specific roles.

Some other things to consider are how you define an SLA - for example, in transactions per second or in commit latency that users experience - and the interplay between SLAs - for example, the commit latency SLA may be affected if the acceptable data loss SLA is zero and a solution such as synchronous database mirroring or remote SAN mirroring is used.

Bottom line is that although it can be simple to quickly define and announce a set of SLAs for a given application, its very difficult to make sure that each is palatable to all involved, guarantee that each can be met, and allow easy diagnosis of the system to work out which component is failing when an SLA is not met. SLAs really need to be defined while a sytem is being designed as retro-fitting SLAs after-the-fact can be very time-consuming and costly.

I welcome any comments or observations on this topic - I'll post on this once I get some more feedback.