Take away from running Windows Azure SQL VM (IASS)

I have been running an extra large Windows Azure SQL VM (A7) for half years.  In here,  I list a couple of recommendation from me.

 Performance

In here, I list a couple of actions which I did to improve performance.   My app is 7/24 data pipeline.  I collect data from many of production clusters and insert them into the DW. At the same time,  we run alert on top of it (scan the rows last inserted, and see whether some thing bad happens), and also daily reports using SSIS.  For me, the big concern is that I need to making sure the VM is always up and I can always insert data.  On the client side,  I did a couple of works to add retry logics, such as do in-place retry three times (i.e., if query time out, I just retry it), and off-line retry (i.e, write the failed sql statement in a log file, and retry many time within one hour until it succeed). 

As you can see, the performance of the system is very important.  So far, I am pretty happy about the HA of the VM, but I did quite a couple of performance tuning in order to get best throughput.

  • Turn on data compression on large tables.  One of the biggest issue for IASS is that the disk I/O is much slower than a regular machine since the disk is actually a blob in Windows Azure.   I put data compression on all most all tables (for simplicity), as a result, I can get average 5 to 10 compression ratio and it greatly improves the performance of my system.   Site note: how do you know you have bottleneck in I/O? Look at the errorlog of SQL Server,  if you see " SQL Server has encountered XX times of I/O operation longer than 15 seconds"  it is an indicator of I/O bottleneck.
  • Create many VHDs and use multiple filegroups on these VHDs.  The principle is pretty simple,  creating more disks, span requests to different disks to reduce I/O on a disk.  Right now, Windows Azure can only allow you to create VHD from one storage account, so this solution will not solve the issue of storage account throttling.   The downside of this approach is that you need to manage many disks and there is no way to get which VHD maps to which disk. I suggest you write down this when you create VHD.
  • Turn off VHD read and write caches.    if you create VHD, it is the default option, you don't need to change it.  If you want to change the values, please Shutdown your SQL Server first since you may hit disk I/O issue cause SQL to recover your database which take much longer

There are a couple of another optimization I did which might not apply to you, but I give them here for your references:

  • Change recovery mode to simple recover mode.  I hit a couple of cases of log file space full and failed to grow (see the reason in backup/restore section), so I decide to change recovery mode to simple.  Suggest you research online for the drawback of this approach and decide whether to use it
  • Enable database to read_committed_snapshot. You can see https://www.mssqltips.com/sqlservertip/2977/demonstrations-of-transaction-isolation-levels-in-sql-server/ for some explain.  In my case, my report queries which sometime touch many rows would not block my insertion queries.
  • Create a large tempdb file and est the recovery mode to simple recover mode.

 

Backup/Restore

I will write another blog to show how you can regular backup database files to azure and how you can restore backup files to VM (it is not trival as you though).  For now, I list a couple of recommendations.

  • Backup to different storage account.  The reason is that if you use the same storage account, your backup job which write to the same account might throttling your Disk I/O on other VHDs.
  • Making sure SQL Agent is running.   When you stop SQL Server,  SQL Agent will also be stopped, but it will not start when SQL server started later. It is important to make sure SQL Agent is running.
  • Backup to BLOB URL directly instead of backup to a file in VHD.   Because the VHD can only be used in the same VM unless you detach it.
  • Use Compression backup to save bandwidth
  • Turn on instant file initialization!!!!!!  This is the first thing you should do.  Why? Here is the explain from people:

If restore is slow, the first thing to be sure of is that instant file initialization is turned on. See this BOL article for instructions: https://technet.microsoft.com/en-us/library/ms175935(v=SQL.105).asp

If restore is still slow with instant file initialization turned on, look at the size of the log file on the instance where the database was backed up. If it is very large (multiple GBs), it would be expected that restore would be slow. During restore the log file must be zeroed which takes a significant amount of time. If that is not the case, turn on traceflag 3051 when doing the restore to generate a detailed log from the BackupToUrl process. This log file will be put into the errorlog directory and will be named something like BackupToUrl*.log. In that log file we output information about each round trip to Windows Azure Storage including timing information that can be helpful in diagnosing the issue.

Please note, this also apply any operations which create files on the VHD, so it is important to turn it on.  I have one case that it took 36 hours to create a database with 200G log file.  It also hurt auto file group performance as well.

Security

  • Disable SA account.  You need to disable and rename SA account since people just try to access your db using this account.
  • Only allow certain IP to access.  If possible, please try hard to only limit a set of IP address who can access your SQL Server.   The reason is that your VM is internet facing, and everyone can ping your VM.  You can also consider to change the default port of SQL Server, so that people will not find your SQL Server in Internet.
  • Limit Remote Desktop Connection.  I did this in Windows Azure Management Portal to only allow limited IP range can remote login to the machine.

Monitoring

  •  Monitoring your storage account throttling in portal.  You should enable monitoring your storage account in Management portal. it allow you see the trend of network I/O, requests.  The most important part is throttling,  we can define an alert (such as 0.1%)  when your account was throttled.  I found it is useful and I usually receive an alert every two days.
  • Of course, you can build some telemetries to monitor the data insertion rate,  data insertion errors.  I record all such thing in the database and in the log, so that I knew what happens and how serv of this issue.

Manageability

  • How to restart SQL Server safely.   There are several cases that I need to restart SQL Server while my collection agent is still inserting data into my database.  If you can stop the workload temporary, that is ideal case.  If not,  you can issue a couple of "checkpoint" to make sure the changes are flushed and then quickly run net stop mssqlserver on command line windows.  Why you need this? because otherwise,  you may get a very long recover time (it took 30 minutes to 1 hours for me to recover my DB).  Also remember to restart SQL Agent as well
  • How to troubleshooting performance issue

You can also all available scripts for troubleshooting. I personally use  Active Monitoring for some issues.