Dipping My Toes Into SQL Azure – Part 2 – Protection Mechanisms

clip_image001

 db_db_file_space_usage and dm_db_task_space_usage for looking at space usage  





Locks



Each lock requires a memory allocation and SQL Azure requires a given transaction to be a good citizen. I set the sp_configure value for locks to 1,000,000 (1 million) which is far below the current SQL Azure limit but again a good way for me to find possible lock consumers in my code.





The DMVs for locks are helpful (dm_tran_locks) and the Trace Events: Locks – Escalation is a good location to find possible queries that could consume locks if escalation is not achieved.



I also enabled trace flag –T1211 (test system only) that disabled lock escalation to help find those queries that could consume a large number of locks.





One I found these queries I revisited the batch size of the transaction and in some instances updated query hints. For example ReadTrace is doing a load operation and during the load there is no other access to the tables. I could use the TABLOCK query hint to reduce the lock memory I am using and it does not change any concurrent access paths for my application.





Transaction Log Space



The databases from various tenants using SQL Azure share disk drives. It would be an inefficient system is each database got its own set of disk drives and even if each had its own disk drive allowing the log to grow unbounded will lead to out of space issues. The protection mechanisms in SQL Azure monitor how much space a transaction has consumed as well as how much log can’t be truncated from all activity in the database because an open transaction is holding the truncation point of the transaction log.





Just like I did for TEMPDB I set the log size to a fixed MAX of 512MB on my stand-alone server and setup a job to backup the log once a minute. Again, far lower than the current SQL Azure target but good for identifying code lines I need to evaluate. If I have a transaction produces more log than 512MB per minute occurs my testing encounters the out of space errors and I can make the necessary application changes.





You can monitor the space usage in with DMVs similar to how I described this in the TEMPDB section.





I also had a test run where I allowed the log to grow unbounded and monitored the size of my log backups. This accounts for situations where a transaction might not use a large amount of transaction log space but a transaction is open for a long time and holds the truncation point, causing other transactions to accumulate large amounts of combined log space.





Transaction Length



Along with the transaction log space check is a transaction duration check. As described above a long running transaction could be idle but holding the log truncation point. I established job that would look at dm_tran_database_transactions with a database_transaction_begin_time greater than 5 minutes. When I find it I would issue the appropriate KILL, triggering errors in my test run for those areas in my code that need to look at their transaction duration and concurrency needs.





CPU



CPU is another resource that has to be shared on any server and a runaway query is something that SQL Azure can prevent as well. This is yet another scenario that I can’t just give you a number for. If you are using CPU but it is not causing CPU contention on the system SQL Azure may not take any action. If you are familiar with the Resource Governor behavior in SQL Server it is similar to SQL Azure behaviors. The resource governor does not penalize you for using CPU it only makes sure the contenders for CPU are fairly treated. Without getting into the details of the CPU activity of SQL Azure you can assume that a session may be terminated if the system determines it is a CPU hog that impacts the overall performance characteristic of the server.





I can’t simulate this directly with resource governor because resource governor only handles fairness and does not take a KILL action on the session like SQL Azure. Instead I created a job that uses the information in dm_exec_sessions and dm_exec_requests to find CPU consumers. Since SQL Azure currently sets MAX DOP = 1 you too can set this sp_configure so you don’t have to worry about parallel worker roll-ups in the queries. If since the last batch start time and CPU consumption exceeds 50% for a 5 minute period I KILL the session. This is a query that is longer running and I need to evaluate if it can be tuned or batched in a way to prevent constant CPU usage for 5+ minutes.





Memory



A single query is not allowed to consume all the memory for the instance. Just like each database does not get its own disk drive, each database/query does not get its own memory bank. When SQL Azure determines a session is consuming detrimental amounts of memory the session can be killed. To simulate this I set the sp_configure, max server memory value to 1GB. Again, this works better with functional tests but it is a good way to find batches that need to be evaluated.





At a more, individual query level you can use resource governor to establish a MAX MEMORY setting. I went as far as to modify the application name I was connecting with for each connection. For example Conn_1, Conn_2, … and setup matching resource governor pools and groups so I could indicate the necessary limits to help me test for SQL Azure compliance.



Request Limit










A request is an active command for the SQL Server (dm_exec_requests). One of the mechanisms is to make sure requests are progressing properly and doing meaningful work. If the number of active requests for a single database grow past a reasonable limit requests may be killed to maintain proper capabilities. The data is based on the number of active requests and how long the transactions can be active at the request level. The more requests the shorter the transaction duration must be to maintain the same active request level. For example (let me just pick a random number) if you have less than 400 requests the allowable transaction duration might be 10 minutes but if you exceed 400 active requests the allowable duration may drop to 5 minutes.





This is easy to simulate and I added some logic into my transaction length simulation job. As the number of requests grow, I shrink the time the transaction may be active before I issue a KILL on the request. I picked some low numbers so I made sure my application was processing transactions in an efficient way. I picked 5 minutes by default and for every 50 more, active requests, I dropped the target duration of my transaction by ¼ until I reached a 30 second minimum limit cap for the KILL logic.





A Peek Inside



As I mentioned SQL Azure is evolving and adapting to customer needs but here are some of the common protection mechanisms currently employed.






- The session has been terminated because it has acquired too many locks. Try reading or modifying fewer rows in a single transaction.

- Session is terminated because you have a long running transaction. Try shortening your transaction.

- Session is terminated because you have a long running transaction. Try shortening your transaction.

- The session has been terminated because of excessive TEMPDB usage. Try modifying your query to reduce temporary table space usage.

- The session has been terminated because of excessive transaction log space usage. Try modifying fewer rows in a single transaction.

- The session has been terminated because of excessive transaction log space usage. Try modifying fewer rows in a single transaction.

- The session has been terminated because of excessive memory usage. Try modifying your query to process fewer rows.

- The service is experiencing a problem that is currently under investigation.

- The database has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.

- The partition is in transition and transactions are being terminated.






Additional Reference: https://social.technet.microsoft.com/wiki/contents/articles/1541.aspx#Troubleshooting



Other Testing Hints and Tips





Cross Database



Since you can’t issue a USE database or cross database queries I changed my testing system to use multiple SQL Server instances. Each instance only supports a single database so even if my application uses DB1 and DB2 they are on separate instances and I am able to identify issues in my code.





Idle Connections



I have found that with the additional firewalls, proxies and routers involved; going external to my corporate network may also drop connections totally unrelated to SQL Azure backend. For my testing I added a job that would KILL any session that had not issues a batch in the last 1 minute. 1 minute is a short window but I found my home ISV has a rule that terminates idle connections at 60 seconds. This leads to any number of design reviews in my code and possible keep alive activities for critical connections. (Keep alive is expense so only consider this for critical connections and assume that they can be terminated by other mechanisms anyway.)





Random Kills



Since the protection mechanisms are not under your control, Microsoft is updating them and changing them to maintain the best systems possible as well as your network provider and even your network administrator a simple test is to add a random KILL job to your test suite. This will simulate any number of dropped connection issues and allow you to validate the application recovery capabilities and handling.





Database Size



When you sign-up for SQL Azure the contacts involves a database size. Fix the database max size in your testing environment to simulate the same behavior as SQL Azure.





Failover Simulations



SQL Azure keeps 3 replica of the database to protect your data. Along with the SQL Server protection mechanisms I have described there are additional mechanisms that check the service state, machine level stability issues and more. These can trigger replica movement/failover actions. You can simulate this easily by restarting the SQL Server service or an even easier way is ALTER DATABASE SET RECOVERY SIMPLE with ROLLBACK IMMEDIATE followed by ALTER DATABASE SET RECOVERY FULL with ROLLBACK IMMEDIATE. It is a great way to test your application stability.





Max Workspace Memory



This is another place for memory consumption. Using max server memory and the memory setting of resource governor is a good way to find queries that may need attention. For example, if you query sys.dm_exec_query_memory_grants and the grant is greater than ~16384K and has been running for more than say 20 seconds and you have another worker that has waited for a memory grant for more than 20 seconds the memory consuming query may be terminated. My setting the max server memory or workload properties you can simulate the conditions.



Bob Dorr - Principal SQL Server Escalation Engineer

Assistance Provided by Keith Elmore - Principal SQL Server Escalation Engineer