We're Live in the Cloud

I've been really busy this year working on the team that shipped version 1 of the Social eXperience Platform (SXP).  SXP (previously known as Project Austin) is a multi-tenant web service that enables social media capabilities across microsoft.com.  SXP v1 delivers comments and ratings services and our first customer is the Showcase site on microsoft.com.  We're working to extend SXP and onboard additional partners.

One of our goals with SXP is to provide the ability to comment or rate "anything" - much like the Facebook "like button" that is showing up everywhere.  There are a lot of existing solutions to this problem, but we couldn't find any that weren't tied to specific types of data.  The system we are replacing used "blogs" and "blog posts" to enable comments on "videos".  While this worked, it was hard to understand and explain.  We're also seeing a tremendous cost savings by leveraging Windows Azure.

The cloud - Windows Azure and SQL Azure specifically - is a perfect fit for this type of solution.  In a couple of months, we were able to ship a set of web services and a moderation tool hosted on Windows Azure and SQL Azure. 

Technically, we're a hybrid cloud solution.  SXP runs on Windows Azure and SQL Azure, but the Showcase web servers are part of the standard microsoft.com servers running in our internal data center.  We have a really good network pipe to the Azure data center, but it's still remote and outside of our operations team's control.

Over the next few blog posts, I'll go into details about our experiences (mostly positive), but here are the high points:

Azure is Agile - In my opinion, the very best thing about Azure is how quickly I can setup and deploy a new environment.  No need to get budget approval, order hardware, rack, stack, and configure.  It's even faster than our standard VM deployment process.

Azure Just Works - Maybe this is the best thing about Azure ... We've been live for 6 weeks and most of our daily availability is 100% (we measure availability at the transaction level).  Our worst day is 99.992% - that's over 4 9s!  Our first month is 99.9991%.  Five 9s - out of the box with no capital charges and a reasonable monthly fee.  We haven't had two transactions in a row fail in over a month and we intentionally chose not to write a bunch of complex retry code - this is all Azure.  I wish I had this in my dot com days ...  I realize this is a small data sample, but still - nice job Azure teams.

Azure is Familiar - While there are some differences (potentially large differences if you're porting a large code base), Azure is "just .NET and SQL".  Because Azure is so prescriptive, many of the architectural decisions are made for you.  This could be good or bad depending on your situation - it was good for us.

Azure is Different - Writing the code was familiar.  Understanding the differences in SDLC, deployment, and operations was more challenging.  This is getting better every week as more and more scripts and tools become available, but don't underestimate the operations impact and make sure to instrument your code and turn on Azure Diagnostics.  Forgetting a small change to web.config and having to recompile, re-package, and re-deploy vs. just copying the file to the server took some getting used to.  However, see future point about consistency.  In the long-run, I like the trade-off. 

Azure Scales - This goes back to my point earlier about "just .NET and SQL", but we have been very pleased with the performance and scale of the Azure solution.  We cache our configuration data, but for everything else, we go back to the databases.  We see great response times and they degrade nicely as load increases.  We have hours that are 3x our normal traffic loads and while the average response increases, it is still well within our SLA and we haven't had any hung requests.

Upgrading with Zero Missed Transactions - This was actually pretty cool.  We used the "VIP Swap" feature to promote our staging environment to production and didn't miss a single transaction.  We've done this twice now with a 3rd upgrade not too far out.  I have implemented this feature several different ways over the years, but I have never seen it this smooth, especially on the first try.  Oh yeah, it only costs a few bucks compared to having to duplicate your entire environment.  Very cool feature.

Azure Enables Global Development - I'm in Austin, part of the team is in Redmond, the rest of the team is in India, and our data center is in San Antonio.  Performance is excellent from Austin and Redmond and good from Hyderabad.  In future releases, we'll leverage the Azure data centers in Asia as well as the US based ones, so this will just get better.

Azure is Consistent - The test team was in India and the dev team was in the US, so we pretty much worked around the clock.  On projects like this in the past, test has wasted time chasing bugs that were "configuration" related.  That didn't happen this time.  Every deployment that passed the smoke test, worked.  This saved us a bunch of time and allowed us to make our aggressive timeline.  As I mentioned earlier, the trade-off is that you can't just update the web.config file.

Instrumentation is Insight - Because we knew we were facing a totally unknown operations environment, we heavily instrumented our code (that's dev speak for "over-instrumented").  We leveraged the out of the box Azure Diagnostics to do the heavy lifting, then wrote some small utilities. (We're planning to be early adopters of the System Center / Azure management pack as it enters beta)  Because of the instrumentation, we've gained some business insight that we didn't have before and it's shaping v2.  The concept of "design for operations" is crucial - get your ops team onboard early.  We have a great ops team and couldn't have shipped without them.

Performance Testing - Running perf and stress tests was a bit challenging.  We ended up writing some custom code that we deployed to Azure to do the testing for us (that was actually a lot of fun).  Since we're a web service, this wasn't difficult to accomplish.  The problem with using Visual Studio to run the tests like we normally would, is the network pipe between the client and the server.  It skews all of the results and very quickly becomes a bottleneck.  It also runs up bandwidth costs and might anger your network administrators.  Be prepared to spend some extra time here.  This is a great ISV opportunity.

Bleeding Edge - We were starting our dev right as Azure v1 was releasing, so there were some early adopter pains that have gotten much better lately.  There were some Azure portal performance issues early on, but we haven't seen those in months and once our deployments were running, we didn't have any performance issues due to Azure.

Leverage the Community - The cloud and Azure are fast moving targets.  We just announced major new enhancments to Windows Azure and SQL Azure at Tech Ed this week.  The Azure portal is a great first place to start and the Windows Azure Communities are a wealth of information.