UPDATE: This was corrected in a new build of the Preview Portal that was rolled out at the end of last week.
Along with the expected challenges of the services world (rapidly changing features and quick adoption of new services), there are some unexpected ones. One of the most recent ones I have dealt with involved some unexpected ALTER DATABASE operations coming through on customer’s Windows Azure SQL databases. As it turns out, it was a side effect of having a beta portal associated with some production Windows Azure SQL databases.
It all started a couple of weeks ago when we had a customer complaining that some of their load tests were getting interrupted by failures like this:
8/2/2012 12:56:06 PM: ** The ALTER DATABASE command is in process. Please wait at least five minutes before logging into database 'myDB', in order for the command to complete. Some system catalogs may be out of date until the command completes. If you have altered the database name, use the NEW database name for future activity.
Login failed for user 'myAdmin'.
This session has been assigned a tracing ID of 'c4a5942b-7141-446b-9227-bbf65ac00df8'. Provide this tracing ID to customer support when you need assistance.
8/6/2012 1:36:22 AM: ** Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
at System.Data.SqlClient.TdsParserStateObject.ReadSni(DbAsyncResult asyncResult, TdsParserStateObject stateObj)
at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)
at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe)
at System.Data.Linq.SqlClient.SqlProvider.Execute(Expression query, QueryInfo queryInfo, IObjectReaderFactory factory, Object parentArgs, Object userArgs, ICompiledSubQuery subQueries, Object lastResult)
at System.Data.Linq.SqlClient.SqlProvider.ExecuteAll(Expression query, QueryInfo queryInfos, IObjectReaderFactory factory, Object userArguments, ICompiledSubQuery subQueries)
at System.Data.Linq.SqlClient.SqlProvider.System.Data.Linq.Provider.IProvider.Execute(Expression query)
The mystifying thing was that the customer was very adamant about the fact that they weren’t issuing these ALTER DATABASE commands. We dug into the service logs and verified that the ALTER commands weren’t issued from within the service as some sort of scheduled operation done by the infrastructure. This matched my overall understanding of our infrastructure, so at this point we went back to the customer to try to figure out where in their test they were scripting the ALTER DATABASE command. We suspected a script to some degree because the ALTER DATABASE command wasn’t changing any database properties. It was just applying the properties that already existed on the database.
While still working with the customer on where in their load test they had scripted the ALTER DATABASE command, I heard from some of my colleagues that they also had customers reporting unexpected ALTER DATABASE commands. Now, at this point, I began to strongly suspect something within the Azure world. I didn’t have any strong proof because there didn’t seem to be any correlation between the customers reporting the behavior, but it was just too coincidental to not be related.
Next, we went back to our counterparts in the SQL Development team, but to no avail. Nobody was aware of any scheduled activity that could trigger this. If the SQL Development team wasn’t doing it, where else could it be coming from and still hitting multiple customers?!?!
Normally, I would start suspecting the customer was telling us what they thought was the truth (even if it wasn’t) but that we would find all of these customers had an ALTER DATABASE embedded in their deployment scripts somewhere. I didn’t really believe it since we have never had this problem in all my years of supporting Windows Azure SQL Database, but I just couldn’t come up with any other explanation. At this point, I have to admit that part of me was starting to think this was all a conspiracy to make me go crazy!
Then, we had a huge breakthrough in the investigation. One of my colleagues was working with his customer and noticed that in the middle of the test the customer went out to the Windows Azure Preview Portal and increased the number of worker roles to handle the increased load. Right after that change, he saw the ALTER DATABASE hit the customer’s server! Being the suspicious type, my colleague repeated the exercise with his customer and saw the behavior again. Aha – now we were on to something!
At this point, I strongly suspected that the customer had something in his OnStart() event for the Windows Azure role because (and sadly I quote myself here) “…there is absolutely no connection between Windows Azure role properties and Windows Azure SQL databases…”. Luckily, my colleague who was working with this particular customer didn’t trust me (see, sometimes a little doubt is a good thing!) and he worked up his own Windows Azure deployment with linked resources (going on the assumption that linking them was the only way to make deployment somehow “aware” of the database). Lo and behold – updating the properties of his worker roles triggered an ALTER DATABASE! (Thanks for doubting me, John! )
At this point, we had a pretty good idea of what was happening. The key combination is having a linked database resource and issuing a Windows Azure role property change command in the new Windows Azure Preview Portal.
Here’s how you set up a linked database. First, go into your Windows Azure deployment from within the Preview Portal and then click on Linked Resources. This will give you a dialog that looks like this:
Once you select a database, you will then see the database listed as a linked resource as shown below:
The core problem here is that the Save command triggers an execution of the ALTER DATABASE even if there aren’t any changes to the database properties!!! In hindsight, I can see what someone was trying to do when they created this Save dialog. They were trying to allow you to scale up on the fly inside the Windows Azure part of the dashboard without forcing you to go the the SQL Azure part of the dashboard to do the same.
The core problem is that the ALTER DATABASE command doesn’t inherently know there are no changes. Because it takes a database level lock, the command ends up blocking any incoming connection attempts for the length of the ALTER. If for some reason the ALTER takes a non-trivial amount of time (and it shouldn’t, but unfortunately does once in a while), it then causes failures for subsequent requests.
The solution to this is to only issue the ALTER DATABASE command if the database properties are changed and this is indeed what we are going to implement. This change should be live in the new couple of weeks. In the interim, if you run into this problem you can either refrain from making role changes via the Preview Portal (just use the original Windows Azure portal) or unlink your Windows Azure SQL database from your Windows Azure deployment.