Scaling Git (and some back story)

A couple of years ago, Microsoft made the decision to begin a multi-year investment in revitalizing our engineering system across the company.  We are a big company with tons of teams – each with their own products, priorities, processes and tools.  There are some “common” tools but also a lot of diversity – with VERY MANY internally developed one-off tools (by team I kind of mean division – thousands of engineers).

There are a lot of downsides to this:

  1. Lots of redundant investments in teams building similar tooling
  2. Inability to fund any of the tooling to “critical mass”
  3. Difficulty for employees to move around the company due to different tools and process
  4. Difficulty in sharing code across organizations
  5. Friction for new hires getting started due to an overabundance of “MS-only” tools
  6. And more…

We set out on an effort we call the “One Engineering System” or “1ES”.  Just yesterday we had a 1ES day where thousands of engineers gathered to celebrate the progress we’ve made, to learn about the current state and to discuss the path forward.  It was a surprisingly good event.

Aside… You might be asking yourself – hey, you’ve been telling us for years Microsoft uses TFS, have you been lying to us?  No, I haven’t.  Over 50K people have regularly used TFS but they don’t always use it for everything.  Some use it for everything.  Some use only work item tracking.  Some only version control.  Some build …  We had internal versions (and in many cases more than one) of virtually everything TFS does and someone somewhere used them all.  It was a bit of chaos, quite honestly.  But, I think I can safely say, when aggregated and weighed – TFS had more adoption than any other set of tools.

I also want to point out that, when I say engineering system here, I am using the term VERY broadly.  It includes but is not limited to:

  1. Source control
  2. Work management
  3. Builds
  4. Release
  5. Testing
  6. Package management
  7. Telemetry
  8. Flighting
  9. Incident management
  10. Localization
  11. Security scanning
  12. Accessibility
  13. Compliance management
  14. Code signing
  15. Static analysis
  16. and much, much more

So, back to the story.  When we embarked on this journey, we had some heated debates about where we were going, what to prioritize, etc.  You know, developers never have opinions. 🙂  There’s no way to try to address everything at once, without failing miserably so we agreed to start by tackling 3 problems:

  • Work planning
  • Source control
  • Build

I won’t go into detailed reasons other than to say those are foundational and so much else integrates with them, builds on them etc. that they made sense.  I’ll also observe that we had a HUGE amount of pain around build times and reliability due to the size of our products – some hundreds of millions of lines of code.

Over the intervening time those initial 3 investments have grown and, to varying degrees, the 1ES effort touches almost every aspect of our engineering process.

We put some interesting stakes in the ground.  Some included:

The cloud is the future – Much of our infrastructure and tools were hosted internally (including TFS).  We agreed that the cloud is the future – mobility, management, evolution, elasticity, all the reasons you can think of.  A few years ago, that was very controversial.  How could Microsoft put all our IP in the cloud?  What about performance?  What about security?  What about reliability?  What about compliance and control?  What about…  It took time but we eventually got a critical mass OK with the idea and as the years have passed, that decision has only made more and more sense and everyone is excited about moving to cloud.

1st party == 3rd party – This is an expression we use internally that means, as much as possible, we want to use what we ship and ship what we use.  It’s not 100% and it’s not always concurrent but it’s the direction – the default assumption, unless there’s a good reason to do something else.

Visual Studio Team Services is the foundation – We made a bet on Team Services as the backbone.  We need a fabric that ties our engineering system together – a hub from which you learn about and reach everything.  That hub needs to be modern, rich, extensible, etc.  Every team needs to be able to contribute and share their distinctive contributions to the engineering system.  Team Services fits the bill perfectly.  Over the past year usage of Team services within Microsoft has grown from a couple of thousand to over 50,000 committed users.  Like with TFS, not every team uses it for everything yet, but momentum in that direction is strong.

Team Services work planning – Having chosen Team Services, it was pretty natural to choose the associated work management capabilities.  We’ve on-boarded teams like the Windows group, with many thousands of users and many millions of work items, into a single Team Services account.  We had to do a fair amount of performance and scale work to make that viable, BTW.  At this point virtually every team at Microsoft has made this transition and all of our engineering work is being managed in Team Services

Team Services Build orchestration & CloudBuild – I’m not going to drill on this topic too much because it’s a mammoth post in and of itself.  I’ll summarize it to say we’ve chosen the Team Services Build service as our build orchestration system and the Team Services Build management experience as our UI.  We have also built a new “make engine” (that we don’t yet ship) for some of our largest code bases that does extremely high scale and fine grained caching, parallelization and incrementality.  We’ve seen multi-hour builds drop sometimes to minutes.  More on this in a future post at some point.

After much backstory, on to the meat 🙂

Git for source control

Maybe the most controversial decision was what to use for source control.  We had an internal source control system called Source Depot that virtually everyone used in the early 2000’s.  Over time, TFS and its Team Foundation Version Control solution won over much of the company but never made progress with the biggest teams – like Windows and Office.  Lots of reasons I think – some of it was just that the cost for such large teams to migrate was extremely high and the two systems (Source Depot and TFS) weren’t different enough to justify it.

But source control systems generate intense loyalty – more so than just about any other developer tool.  So the argument between TFVC, Source Depot, Git, Mercurial, and more was ferocious and, quite honestly, we made a decision without ever getting consensus – it just wasn’t going to happen.  We chose to standardize on Git for many reasons.  Over time, that decision has gotten more and more adherents.

There were many arguments against choosing Git but the most concrete one was scale.  There aren’t many companies with code bases the size of some of ours.  Windows and Office, in particular (but there are others), are massive.  Thousands of engineers, millions of files, thousands of build machines constantly building it, quite honestly, it’s mind boggling.  To be clear, when I refer to Window in this post, I’m actually painting a very broad brush – it’s Windows for PC, Mobile, Server, HoloLens, Xbox, IOT, and more.  And Git is a distributed version control system (DVCS).  It copies the entire repo and all its history to your local machine.  Doing that with Windows is laughable (and we got laughed at plenty).  TFVC and Source Depot had both been carefully optimized for huge code bases and teams.  Git had *never* been applied to a problem like this (or probably even within an order of magnitude of this) and many asserted it would *never* work.

The first big debate was – how many repos do you have – one for the whole company at one extreme or one for each small component?  A big spectrum.  Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos.  Hmm.  Ever worked in a huge code base for 20 years?  Ever tried to go back afterwards and decompose it into small repos?  You can guess what we discovered.  The code is very hard to decompose.  The cost would be very high.  The risk from that level of churn would be enormous.  And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code.  Trying to coordinate that across hundreds of repos would be very problematic.

After much hand wringing we decided our strategy needed to be “the right number of repos based on the character of the code”.  Some code is separable (like microservices) and is ideal for isolated repos.  Some code is not (like Windows core) and needs to be treated like a single repo.  And, I want to emphasize, it’s not just about the difficulty of decomposing the code.  Sometimes, in big highly related code bases, it really is better to treat the codebase as a whole.  Maybe someday I’ll tell the story of Bing’s effort to componentize the core Bing platform into packages and the versioning problems that caused for them.  They are currently backing away from that strategy.

That meant we had to embark upon scaling Git to work on codebases that are millions of files, hundreds of gigabytes and used by thousands of developers.  As a contextual side note, even Source Depot did not scale to the entire Windows codebase.  It had been split across 40+ depots so that we could scale it out but a layer was built over it so that, for most use cases, you could treat it like one.  That abstraction wasn’t perfect and definitely created some friction.

We started down at least 2 failed paths to scale Git.  Probably the most extensive one was to use Git submodules to stitch together lots of repos into a single “super” repo.  I won’t go into details but after 6 months of working on that we realized it wasn’t going to work – too many edge cases, too much complexity and fragility.  We needed a bulletproof solution that would be well supported by almost all Git tooling.

Close to a year ago we reset and focused on how we would actually get Git to scale to a single repo that could hold the entire Windows codebase (include estimates of growth and history) and support all the developers and build machines.

We tried an approach of “virtualizing” Git.  Normally Git downloads *everything* when you clone.  But what if it didn’t?  What if we virtualized the storage under it so that it only downloaded the things you need.  So clone of a massive 300GB repo becomes very fast.  As I perform Git commands or read/write files in my enlistment, the system seamlessly fetches the content from the cloud (and then stores it locally so future accesses to that data are all local).  The one downside to this is that you lose offline support.  If you want that you have to “touch” everything to manifest it locally but you don’t lose anything else – you still get the 100% fidelity Git experience.  And for our huge code bases, that was OK.

It was a promising approach and we began to prototype it.  We called the effort Git Virtual File System or GVFS.  We set out with the goal of making as few changes to git.exe as possible.  For sure we didn’t want to fork Git – that would be a disaster.  And we didn’t want to change it in a way that the community would never take our contributions back either.  So we walked a fine line doing as much “under” Git with a virtual file system driver as we could.

The file system driver basically virtualizes 2 things:

  1. The .git folder – This is where all your pack files, history, etc. are stored.  It’s the “whole thing” by default.  We virtualized this to pull down only the files we needed when we needed them.
  2. The “working directory” – the place you go to actually edit your source, build it, etc.  GVFS monitors the working directory and automatically “checks out” any file that you touch making it feel like all the files are there but not paying the cost unless you actually access them.

As we progressed, as you’d imagine, we learned a lot.  Among them, we learned the Git server has to be smart.  It has to pack the Git files in an optimal fashion so that it doesn’t have to send more to the client than absolutely necessary – think of it as optimizing locality of reference.  So we made lots of enhancements to the Team Services/TFS Git server.  We also discovered that Git has lots of scenarios where it touches stuff it really doesn’t need to.  This never really mattered before because it was all local and used for modestly sized repos so it was fast – but when touching it means downloading it from the server or scanning 6,000,000 files, uh oh.  So we’ve been investing heavily in is performance optimizations to Git.  Many of them also benefit “normal” repos to some degree but they are critical for mega repos.  We’ve been submitting many of these improvements to the Git OSS project and have enjoyed a good working relationship with them.

So, fast forward to today.  It works!  We have all the code from 40+ Windows Source Depot servers in a single Git repo hosted on VS Team Services – and it’s very usable.  You can enlist in a few minutes and do all your normal Git operations in seconds.  And, for all intents and purposes, it’s transparent.  It’s just Git.  Your devs keep working the way they work, using the tools they use.  Your builds just work.  Etc.  It’s pretty frick’n amazing.  Magic!

As a side effect, this approach also has some very nice characteristics for large binary files.  It doesn’t extend Git with a new mechanism like LFS does, no turds, etc.  It allows you to treat large binary files like any other file but it only downloads the blobs you actually ever touch.

Git Merge

Today, at the Git Merge conference in Brussels, Saeed Noursalehi shared the work we’ve been doing – going into excruciating detail on what we’ve done and what we’ve learned.  At the same time, we open sourced all our work.  We’ve also included some additional server protocols we needed to introduce.  You can find the GVFS project and the changes we’ve made to Git.exe in the Microsoft GitHub organization.  GVFS relies on a new Windows filter driver (the moral equivalent of the FUSE driver in Linux) and we’ve worked with the Windows team to make an early drop of that available so you can try GVFS.  You can read more and get more resources on Saeed’s blog post.  I encourage you to check it out.  You can even install it and give it a try.

While I’ll celebrate that it works, I also want to emphasize that it is still very much a work in progress.  We aren’t done with any aspect of it.  We think we have proven the concept but there’s much work to be done to make it a reality.  The point of announcing this now and open sourcing it is to engage with the community to work together to help scale Git to the largest code bases.

Sorry for the long post but I hope it was interesting.  I’m very excited about the work – both on 1ES at Microsoft and on scaling Git.

Brian