Transient Connections in Windows Azure: Useful Resources

This is a collection of resources I have encountered while researching the problem of intermittent connections in Windows Azure. I will be writing about how these affect not just Entity Framework, but Windows Azure more generally.

"Transient connections" is a polite way of saying "dropped connections". These are of course a fact of life when developing distributed appilications: a wide variety of network errors can occur. However there are additional causes of dropped connections in Windows Azure, and these additional cases are not all bad things.

Here are the main causes of dropped connections:

  • the usual network errors that are inherent in distributed applications
  • SQL Azure throttling exceptions: because SQL Azure is a multi-tenant database service based on shared resources, it will throw exceptions on any applications that use "too many" resources. SQL applications that hog resources are generally a bad thing, and can often be a symptom of a badly written query. Dealing with SQL Azure throttling exceptions is pretty similar to applying best practices to any distributed SQL application.
  • Windows Azure has extensive capabilities for reliability, load balancing, and scaling: these are in fact some of the major benefits of using Windows Azure, and were high priority design goals for the service. One implication of this architecture is that there are replicated instances of your application, and that sooner or later, one of these instances will get taken down (either by hardware failure, or database sharding, or resource re-balancing). Ideally these actions would be invisible to your application, but at this point in time your application must anticipate connections dropping, and be able to handle reconnecting.
  • SQL Azure "sharding" aka the Federations feature: contemplate what happens when your app is executing a query while a partition is being "re-sharded"; or if you are running a multi-partition query while re-sharding happens, and some of the partitions are in the old state, and some in the new state. I will be researching this scenario more in the near future: I am just calling it to your attention to think about, as a source of interesting query behavior.

Here are some resources that I have found useful.

SQL Azure Connection Management: provides a great overview, as well as detailed listings and explanations of the individual exceptions you may encounter.

SQL Azure and Entity Framework Connection Fault Handling: very detailed explanation with code samples of how to handle the exceptions encountered when using the ObjectContext class in Entity Framework. Uses a retry framework described in the next resource.

Best Practices for Handling Transient Conditions in SQL Azure Client Applications: describes a framework that can be used to implement retry logic in SQL Azure; the preceding resource uses this framework.

Transient Fault Handling Framework: the code used by the retry framework in the preceding resource

Minimizing Connection Pool errors in SQL Azure: illustrates how to handle dropped connections

I've particularly found the postings from members of the Windows Azure Customer Advisory Team ("CAT") valuable, since they are based on creating solutions to actual customer problems. Other source of information include Bing, the Windows Azure Platform forums, etc.