I'm very excited to see the Azure Service Fabric announcement! My blog post from 2011, "Designed for the Cloud-it is all about State!", was about the problem this technology solves. In the post I pointed out the AppFabric Container which was were this technology was supposed to become publically available but never launched. I had to wait four long years to finally talk about this technology publically and now I can!
Many of Microsoft's cloud services run this technology; the first of which that I'm aware of was either the original Azure SQL or Azure Cache (they both use it but I can't remember which launched first, probably Azure SQL). There are even a few packaged products like Lync Server which use it too. The first place I saw the technology available to public developers was ActorFx; its release notes and posts like this refer to Windows Fabric which is the Microsoft internal API for the technology.
So why did this technology take so long to become publically available in Azure? I think two things delayed its release:
- Distributed systems are complex and hard to debug. Many teams within Microsoft have had trouble adopting the technology over the years. Developing the code was doable but then came running it in production and debugging issues. The tooling just wasn't adequate for mass adoption. The teams who succeeded were very talented. For Azure Service Fabric core developer scenarios were targeted and the required tooling / Visual Studio integration developed.
- The Azure fabric's maintenance algorithm by default was not very friendly to technologies that maintained state replicas and it was very hard for teams outside of Azure to get approval to enable the alternative maintenance settings required. It was really only in the last year that this was addressed.
The Weather Ingestion Service is the largest service I developed with the technology. The service downloads weather data from several weather data providers, performs calculations to augment the data, and then publishes it to the Weather's REST Service which is used by Bing, MSN, Cortana, Windows and Windows Phone apps, etc. We had a very tight 15 minute SLA to complete the processing and publish the data out to many Azure regions and Bing datacenters. Our initial attempt involved Azure Table Storage but we could not get adequate performance due to query and deserialization costs. We then switched to using the technology to hold the data in memory with replicas to ensure availability. This was back before the large memory Azure roles were available and it took 60 medium Worker Role instances to hold the data. We ran it in production for a couple months but management disliked the cost so they eventually relaxed the processing SLA enough that we could use Azure Cache (the original one) instead.
The most impressive use of the technology I've seen is the service which powers Cortana. As data is pushed to Cortana the service figures out which Cortana enabled devices receive the data. The system is impressive not only for the sheer number of devices that it pushes data to but also in how they used the technology to enable high availability of their stateful service across Azure regions. Their service simply keeps working when one of their clusters becomes unhealthy or the Azure region becomes unreachable. The remaining regions have replicas of the state and simply take on the additional load from the failed region. I doubt this initial release of the Azure Service Fabric will enable this scenario but it is nice to know that the technology is capable of it and such a feature may become available when the tooling is ready.