SXP - One year later

One year ago this month we launched our first Windows Azure platform-based web service, the Social eXperience Platform (SXP). The Windows Azure Platform had just launched, so we were entering uncharted territory. Looking back over the last 12 months, it has been a great experience and the Windows Azure platform has allowed us to realize many of the promises of cloud computing.

Scale
When we launched, we were serving about 10K requests per day. We are now averaging over 450K requests per day with a peak of 687K requests in a single day. More importantly, we have the capability to deliver at least 2M requests per day without having to increase the number of instances.

Agility (part 1)
Windows Azure has made it easy to add functionality to SXP as the business has needed it. SXP has about 4X the lines of code today compared to version 1 which shipped just under a year ago. We have deployed 10 separate times in a year and haven’t incurred any downtime (measured at the transaction level) during any of our upgrades. With the exception of some learning curve issues, all of our deployments have gone very smoothly, even smoother than most of our on-premise deployments. This has given us the confidence necessary to deliver quick updates for the business.

Agility (part 2)
When we designed SXP, we designed it as a multi-tenant, cloud-based service. Our goal (hope!) was that other properties across Microsoft would adopt SXP. We like to talk about moving “capabilities” to the cloud in addition to moving applications to the cloud. SXP is an example of enabling capabilities via the cloud which enables other teams to build composite applications that leverage our service. We have grown from one tenant when we launched to over 35 tenants with over a dozen in the pipeline. Some of our tenants are using SXP in ways that we never dreamed about. The Windows Azure Platform has been a big enabler of this agility.

Elasticity
As we’ve on-boarded new partners, we’ve had 3 separate instances where traffic “doubled overnight”. The Windows Azure Platform allowed us to scale up in anticipation of the increased traffic and to scale back days or even hours later. So far, we’ve had the headroom to handle the increased load, but it’s nice to know that we can very easily add and remove capacity as load increases.
With Windows Azure compute, we were able to double our web service tier capacity in a matter of minutes. Once we were confident we could handle the increased load with fewer web instances, we scaled back to the correct level. It only took minutes to scale back down and the full retail price of the additional capacity was under $100. Compare this to having to order servers or even provision VMs and the increased agility and decreased cost for the elasticity is dramatic.

Performance
SXP’s performance continues to be excellent, with average server response times below 100 ms. We have a separate service that monitors SXP from within the same data center. The service sends a synthetic transaction to SXP every 5 seconds and records the availability and performance of the service. Since we are within the same data center, there are no networking issues involved. Our average ping response for the year is just above 10 ms.
SXP is a global service, but there is some US business hour bias to our load. We notice a slight increase in ping response times during prime US business hours, but the average is still around 12 ms. While this is 20% longer, it is only a 2ms delta, which seems reasonable given the increased load.

Availability
This has been one of the best surprises from our move to Azure. We measure availability at the transaction level and since going live, we have delivered over 4 9s of availability at 99.9957%. To put things in perspective, we’ve had 2,543 errors out of 59M requests. Of those errors, 1,030 occurred during a 1 hour window and over half occurred during a 2 hour window. Neither of those two hours of major downtime were Azure-specific issues.
The Azure team really deserves a lot of kudos for this as Azure is a v1 product that only shipped a few weeks before we went live. They are solving some pretty tough problems, so I was willing to cut them some slack early on. Fortunately, they over-delivered. Now we’re pushing them to help us deliver 5 and 6 9s availability.

Cost
This has been another pleasant surprise. We have been able to significantly reduce our monthly hosting costs by moving to the Windows Azure platform. To be fair, it’s not an exact apples-to-apples comparison, but, on the other hand, SXP delivers far more than our previous solution, so here are the details:
Previously, we used a 3rd party solution to provide comments and ratings for our Showcase site (www.microsoft.com/showcase). Since the solution had been customized for Showcase, our monthly hosting costs were $15K. Because of the customizations, we weren’t able to [easily] leverage the solution across the other microsoft.com sites and also weren’t able to upgrade to the latest version of the software, which drove our support costs up. Basically, we had an expensive application and we wanted a scalable platform.
SXP, our Windows Azure based solution provides increased functionality and we now have 35 sites (tenants) live. Because we were building a service for use across microsoft.com, we built a multi-tenant service from the ground up. This gave us a distinct operational advantage over the 3rd party solution as we are able to on-board new tenants in minutes rather than weeks. This also gives us a huge cost advantage as we’re able to spread the operations costs across multiple tenants.
Our monthly cost for the entire SXP production system using full retail pricing is $1,473, which is a 90% savings over our on-premise solution. Using standard commitment based pricing, that cost drops to $924.19 per month. January through March, Showcase generated 32% of the SXP traffic. Allocating 32% of the costs to Showcase is $296, which is a 98% savings over our old solution. If we divide the cost by 35, each tenant’s cost is $26.41 per month. We’re confident we can drive the monthly per tenant operations cost below $20 and perhaps below $10 or even $5.
At the same time, we increased our availability and greatly increased our agility in responding to business needs. Here is our before and after picture:

 

While everything hasn’t been perfect over our first year of operations, we are very pleased with our Windows Azure platform experience as evidenced by the fact that we are aggressively moving additional workloads to the platform.