Behind Live Mesh: How we run cloud services

A quick self-introduction: I’m Alex Mallet, one of the development leads on the Live Mesh project. I’ve been at Microsoft since ’97, except for an abortive [but instructive] side trip to graduate school in an attempt to get a PhD in computational biology. Just about all of my time has been spent working on distributed systems, of gradually increasing scale – I started out working on IIS, moved to Application Center 2000, worked on our P2P API toolkit and finally ended up on the Live Mesh team about a year and a half ago.  On Live Mesh, my team and I are responsible for making sure our datacenter services are easy to deploy and manage, and for providing common functionality needed by our cloud services. So, on the heels of the previous blog posts that have introduced the “big picture” view, I thought I’d give you a bit more insight into some of the details of the “services” part of “Software + Services”, by talking about our services that run in the cloud.

Our general philosophy when building our cloud services was to adhere to the tenets of Recovery-Oriented Computing (ROC): programs will crash, hardware will fail, and they will do so regularly, so your system should be prepared to deal with these failures. While it’s easy to espouse these principles in theory, the obvious next question is how to turn them into practice, and here we were aided by a great “best practices” survey paper written by James Hamilton, namely “On Designing and Deploying Internet-Scale Services”.  I won’t claim that we managed to do everything that’s in his paper [we’re only at the Tech Preview stage, after all J], but I think we’ve done a decent job so far, and are heading in the right direction overall.

Enough philosophy, on to some more detail.

From a functionality perspective, our cloud services can be grouped into four buckets: dealing with feed and data synchronization, providing authentication and authorization, maintaining and fanning out the system’s transient state [like the various notifications provided in the Live Mesh Bar], and the connectivity services for synchronization and remote desktop access to work across any network topology. Sliced along the “state” axis, we have stateless front-end services, back-end services that maintain in-memory state, and persistent storage layers that handle both structured and unstructured data. From a scaling perspective, our plan is to scale out, not up. Thus, we’ve invested in making sure that we have as many stateless services as possible, as well as having facilities that allow us to partition our state [both persistent and transient] across multiple machines, and reconfigure these partitions as necessary. Overall, we have close to 20 different services, with each service consisting of multiple, redundant instances of a particular bit of code, striped across several racks of machines in the datacenter – in keeping with the ROC assumptions, our goal is to be resilient to multiple hardware and software failures.                

Our front-end services are accessible [only] via HTTPS – all of the traffic that flows in and out of our system is encrypted. Our back-end services use a mixture of HTTPS and custom protocols layered on top of TCP. The vast majority of the services are written in C#, with the only exceptions being services that needed deep integration with Windows functionality that isn’t [easily] accessible to an application written in managed code.

All of our services sit on top of a runtime library that contains facilities commonly needed by each service: process lifetime management, HTTP and TCP listeners, a debug logging facility, a work queue facility, APIs to generate monitoring data like performance counters, etc. This common runtime also contains debugging, testing and monitoring hooks; for example, we have the ability to inject random delays and failures into our HTTP pipeline, which allows us to test our failure monitors and the overall response of the system to slow and failing services.

Building a full-scale datacenter deployment and management system is a huge undertaking, so we chose the lazy smart route and went with an existing, battle-tested system, namely the Autopilot framework, which was developed, and is being used, by the Windows Live Search team to manage their tens of thousands of datacenter machines [we’re not at that scale yet, but we hope to be, with your help J]. We use Autopilot to manage our code and data deployments, for [some of our] failure monitoring and self-healing, and to give us insight into the current state of our datacenter machines and services.

On the monitoring front, we actually monitor the system at several levels – via simple Autopilot-style watchdogs, with more extensive tests called “runners”, by hitting our service from various points outside our datacenter, and also using a variety of tools that scan our logs for error messages, highlight machines that appear to be having problems, look for crashing services etc. Of course, all these monitors are still somewhat untested – I’m sure we’ll be making lots of tweaks, and adding new tools over the coming weeks and months as we start having to troubleshoot and keep a real live system up and running. J

Ok, I think that’s enough for one post. If this is a topic of interest to you, and you’d like more detail on some of the stuff I’ve talked about, please leave suggestions and questions in the comments, and I’ll address them in follow-up posts.

And, of course, don’t forget to sign up for Live Mesh and give us feedback !

Technorati Tags:

Comments (10)
  1. Nektar says:

    Why don’t you use any Soap web services to communicate on the back-end? Why did you have to build everything from scratch?

  2. Alex Mallet says:

    Thanks for the questions.

    I’ll answer the second question first: we didn’t build everything from scratch. As I mentioned already, we reused a huge chunk of work in the form of the Autopilot system. We’ve also reused existing code libraries from Windows Live teams like Messenger and Search. In general, we tried really hard to write as little new code as possible.

    As far as the use of SOAP goes, there are really two answers. For the HTTP-based communication on the backend, we’re  simply reusing the existing REST-based interfaces that are already exposed by the services – building a SOAP layer would have been extra work. TCP is used between services that don’t already have an HTTP interface, are relatively tightly-coupled, and have communication flows that are more suited to the stream-based semantics provided by TCP than the SOAP request-response model.  

  3. John says:

    Probably for the same reason the rest of the web uses REST, not SOAP.

  4. Alex Mallet posted @ the Live Mesh blog about how the Live Mesh cloud services run, how we think about

  5. Jason says:

    It looks like a wonderful service.  I hope that I will be invited to test it out sooner, rather then later.  

  6. myforesight says:

    When are we going to see a Cloud Service from Microsoft, comparable to Amazon EC2 or Google AppEngine?

    As an suggestion, the system would have the following features, easy available for non-professionals:

    – No technical Administration due to tight standards and limited ports

    – Automatically scaling on demand

    – DNS services for external Domain Names (Domain Pointers

    – LiveMail for Domains integration

    to folders, instead httpd.conf settings)

    – Ability to run PHP (or Zend PHP) incl. 100% mod_rewrite compatibility and MySQL additional to M$ standards

    – Billing based on storage/m., CPUhours/m.,traffic/m

    – Interface comparable to Parallels Plesk.. why reinvent the wheel?

  7. Alan Isherwood says:

    So, who do I have to bribe in order to jump the beta queue on getting a Live Mesh account?

  8. Alex Mallet says:

    myforesight: You’ve described an interesting service, but it’s not really the focus of Live Mesh — the features you list belong in a utility computing service, whereas Live Mesh is a synchronization platform.

    And of course we can’t comment on anything other than what our team is building 🙂  

  9. Marc Vallribera i Ros says:

    I think that this is a great service! I’ve always dreamed of a time where my information is simply “there”, not “in my desktop”, “in my laptop”,… But my dream goes a bit futher having the ability to not only synchronise the same file across all the devices but to choose how it will be synchronised. And I don’t mean automatic or manual sync. I mean this: you have 2GB of pictures in your desktop and want to have all of them “everywhere”. But your mobile phone doesn’t have that much capacity. Wouldn’t it be great if you could choose to resize pictures to fit in the device?

    Maybe this can be done as an extension or third-party application inside the Live Mesh service?

    In the past I made a lot of flowcharts and notes about this… maybe you are interested?

    Anyway, you are doing a great work! Keep on it! I hope I’ll soon be invited to try it!

Comments are closed.

Skip to main content