A quick self-introduction: I’m Alex Mallet, one of the development leads on the Live Mesh project. I’ve been at Microsoft since ’97, except for an abortive [but instructive] side trip to graduate school in an attempt to get a PhD in computational biology. Just about all of my time has been spent working on distributed systems, of gradually increasing scale – I started out working on IIS, moved to Application Center 2000, worked on our P2P API toolkit and finally ended up on the Live Mesh team about a year and a half ago. On Live Mesh, my team and I are responsible for making sure our datacenter services are easy to deploy and manage, and for providing common functionality needed by our cloud services. So, on the heels of the previous blog posts that have introduced the “big picture” view, I thought I’d give you a bit more insight into some of the details of the “services” part of “Software + Services”, by talking about our services that run in the cloud.
Our general philosophy when building our cloud services was to adhere to the tenets of Recovery-Oriented Computing (ROC): programs will crash, hardware will fail, and they will do so regularly, so your system should be prepared to deal with these failures. While it’s easy to espouse these principles in theory, the obvious next question is how to turn them into practice, and here we were aided by a great “best practices” survey paper written by James Hamilton, namely “On Designing and Deploying Internet-Scale Services”. I won’t claim that we managed to do everything that’s in his paper [we’re only at the Tech Preview stage, after all J], but I think we’ve done a decent job so far, and are heading in the right direction overall.
Enough philosophy, on to some more detail.
From a functionality perspective, our cloud services can be grouped into four buckets: dealing with feed and data synchronization, providing authentication and authorization, maintaining and fanning out the system’s transient state [like the various notifications provided in the Live Mesh Bar], and the connectivity services for synchronization and remote desktop access to work across any network topology. Sliced along the “state” axis, we have stateless front-end services, back-end services that maintain in-memory state, and persistent storage layers that handle both structured and unstructured data. From a scaling perspective, our plan is to scale out, not up. Thus, we’ve invested in making sure that we have as many stateless services as possible, as well as having facilities that allow us to partition our state [both persistent and transient] across multiple machines, and reconfigure these partitions as necessary. Overall, we have close to 20 different services, with each service consisting of multiple, redundant instances of a particular bit of code, striped across several racks of machines in the datacenter – in keeping with the ROC assumptions, our goal is to be resilient to multiple hardware and software failures.
Our front-end services are accessible [only] via HTTPS – all of the traffic that flows in and out of our system is encrypted. Our back-end services use a mixture of HTTPS and custom protocols layered on top of TCP. The vast majority of the services are written in C#, with the only exceptions being services that needed deep integration with Windows functionality that isn’t [easily] accessible to an application written in managed code.
All of our services sit on top of a runtime library that contains facilities commonly needed by each service: process lifetime management, HTTP and TCP listeners, a debug logging facility, a work queue facility, APIs to generate monitoring data like performance counters, etc. This common runtime also contains debugging, testing and monitoring hooks; for example, we have the ability to inject random delays and failures into our HTTP pipeline, which allows us to test our failure monitors and the overall response of the system to slow and failing services.
Building a full-scale datacenter deployment and management system is a huge undertaking, so we chose the
lazy smart route and went with an existing, battle-tested system, namely the Autopilot framework, which was developed, and is being used, by the Windows Live Search team to manage their tens of thousands of datacenter machines [we’re not at that scale yet, but we hope to be, with your help J]. We use Autopilot to manage our code and data deployments, for [some of our] failure monitoring and self-healing, and to give us insight into the current state of our datacenter machines and services.
On the monitoring front, we actually monitor the system at several levels – via simple Autopilot-style watchdogs, with more extensive tests called “runners”, by hitting our service from various points outside our datacenter, and also using a variety of tools that scan our logs for error messages, highlight machines that appear to be having problems, look for crashing services etc. Of course, all these monitors are still somewhat untested – I’m sure we’ll be making lots of tweaks, and adding new tools over the coming weeks and months as we start having to troubleshoot and keep a real live system up and running. J
Ok, I think that’s enough for one post. If this is a topic of interest to you, and you’d like more detail on some of the stuff I’ve talked about, please leave suggestions and questions in the comments, and I’ll address them in follow-up posts.
And, of course, don’t forget to sign up for Live Mesh and give us feedback !