What are the odds that two pull requests get completed at the exact same time?

The so-called Fall Creators Update of Windows 10 encompassed four million commits in half a million pull requests. A project of this size exposes issues in infrastructure by stressing it to levels previously considered unimagineable.

One of the issues uncovered by hosting the Windows repo on Visual Studio Team Services (VSTS) is that of pull requests being completed simultaneously. As noted in the linked article, the odds of this happening are low on average-sized teams, but for a project as large as Windows, it is a persistent problem. What happens is that multiple pull requests attempt to complete, one of them successfully merges, and the others fail. The people who get the failure push the Complete button again, and the cycle repeats: One succeeds, and the others fail. And all during this time, still more pull requests are joining the party.

You get into a situation where there are continuous race conditions, and while it's true that forward progress is made (one of the pull requests will complete successfully), you unfortunately also create the risk of starvation (some poor guy loses the race ten times in a row), and you burn a lot of CPU cycles calculating merges that end up being thrown away.

The article explains that to address this problem, VSTS added support for queueing, so that pull request completions are serialized rather than using a lock-free algorithm.

I know that one of the ground rules of the Internet is Don't read the comments, but I read the comments on the Ars Technica article anyway. There was some disbelief that two pull requests could complete at exactly the same time. One will certainly be a fraction of a second sooner than the other. So what's this issue with simultaneous pull request completions?

The deal is that completing a pull request takes a nonzero amount of time. For the Windows repo, calculating the merge result can take from five seconds for a small change to significantly longer for larger changes. The index file itself is over a quarter of a gigabyte; you're spending a good amount of time just for the I/O of reading and writing that monster. And then you have to read and compare a ton of trees, and then merge the differences. All this work takes time, and if the calculations for the completions of multiple pull requests overlap, then the one who finishes first wins, and everybody else loses. It's not random who loses the race; it's biased against people who have changes that affect a large number of files. Without queueing, somebody trying to complete a large commit will consistently lose to people with little one-line fixes.¹

Bonus chatter: In a discussion of Windows source control, one person argued that git works great on large projects and gave an example of a large git repo: "For example, the linux kernel repository at roughly eight years old is 800–900MB in size, has about 45,000 files, and is considered to have heavy churn, with 400,000 commits."

I found that adorable. You have 45,000 files. Yeah, call me when your repo starts to get big. The Windows repo has over three million files.

Four hundred thousand commits over eight years averages to around thirteen commits per day. This is heavy churn? That's so cute.

You know what we call a day with thirteen commits? "Catastrophic network outage."

(Commenter Andrew noted that it actually averages to 130 commits per day, not 13. But 130 commits in one day would still count as a catastrophic network outage.)

¹ Note that this race applies only to pull requests targeting the same branch, but there is enough activity even within a single branch that these collisions are frequent.

Clarification: I'm not scoffing at the linux kernel. I'm scoffing at the person who told the Windows team "I don't understand why you aren't using git. Git can totally handle large projects. Why, it can even handle a project as large as the linux kernel!"

Comments (40)
  1. Andrew says:

    Nitpick: 130 commits per day, not 13.

  2. Mantas says:

    Of course, not many people expected that the Windows repository will migrate to Git like, ever.

    1. This was in the context of a discussion of Windows source control. One person argued that git works great on large projects and gave the Linux kernel as an example.

      1. Erkin Alp Güney says:

        It is normal as git-scm is originally developed by Linus Torvalds to keep versions of Linux kernel.

      2. chris says:

        isn’t the linux repo managed with a handful of people allowed to approve each commit, each with authority over a section of the code? also, yes, the kernel is the moral equivalent of the windows core bit, more or less.

  3. creaothceann says:

    inb4 “yeah, but they’re thirteen high quality commits”

  4. torrinj says:

    Makes me wonder what source control was used before. I’m guessing whatever it was didn’t handle three million files and X number of commits a day.

    1. Smithers says:

      I believe I read they used a custom version of Perforce. Interestingly, where I work recently made the same migration, only we split one Perforce server into dozens of individual git repos.

    2. https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-git-and-some-back-story/

      “Source Depot did not scale to the entire Windows codebase. It had been split across 40+ depots so that we could scale it out but a layer was built over it so that, for most use cases, you could treat it like one.”

      I think it’s interesting that Git wasn’t able to handle the Windows sources before Microsoft implemented GVFS.

    3. My rough calculations suggest that the previous system averaged around 12,000 commits and a gigabyte of content and metadata per day.

  5. Smithers says:

    Have you considered using more than one repository? Linux a is only an OS kernel, so if that makes for a “large git repo,” an entire OS seems rather excessive. I wonder how many different source repositories go into even a minimal install of, say, Ubuntu.

    Update to bonus chatter: The master branch of https://github.com/torvalds/linux is currently nearly 13 years old with almost 740,000 commits, now putting it north of 155 commits/day. When I download the zip, it extracts to 67,317 items at 748 MB. I’ll bet Windows isn’t going to catch up on one statistic though — GitHub claims the repository has “∞ contributors!”

    1. See this blog post from Brian Harry which discusses the rationale for a monorepo.

      1. Ravi says:

        I don’t buy Brian’s argument on mono repo. There isn’t a good reason why the thousands of components within Windows have to be built from the top all the time, all at once. The mono repo only exists since NT build (or it’s new monolithic equivalent) demands it. If all components are built and managed at component level, you’d have lots of small repos with it’s own cadence on how they release it. The shipping will just be a packaging issue.

        1. Max says:

          Because of Windows’ monolithic history, the components tend to be tied together in ways that mean they can’t be cleanly separated without significant changes.

        2. voo says:

          I’m assuming you actually know how the code looks like and spent several months thinking about this problem, before simply claiming that this would be perfectly simple, despite the empirical evidence suggesting that it’s anything but?

          Just to get started with the really simple problems: How are you going to approach atomic commits across multiple independent repositories? Because you know, in practice that’s rather important on large, grown projects.

          Pretty much every company that has large code bases (if you have less than a million code files, that’s not you) has chosen a mono repo. So either all those people working at Microsoft, Google, Facebook and co are just idiots or it might just be that in practice monorepos work a great deal better than having dozens of smaller repos.

      2. Brian_EE says:

        Can’t the GVFS be combined with submodule approach for even better performance? And how is shallow cloning tied in? Is that a part of your implementation?

        1. I am not the person to ask about why the monorepo design was chosen. I wasn’t part of the decision. But I do know that merging submodules is next to impossible.

  6. max630 says:

    I remember I was quite surprised to learn that they are not queued. I mean, it’s git, it costs nothing to implement.

    Another thing to do is to implement CI build/testing during merge – from the merged commit but with if test fails that targed branch is reset to parent, to avoid issues caused by merging.

  7. Martin Bonner says:

    So the Fall Creators Update has 10 times as many commits as the *entire* Linux kernel.

    It’s with mature large projects like Windows, that I don’t see how git is going to work – doing a “git clone” is going to be a *major* undertaking.

  8. Scott says:

    It sounds like Windows is using one giant monolithic central remote that all the merge requests are being sent to. If that is the case, then that is the root of this problem. Linux has a more hierarchical tree of forks. One developers commit might go to a lieutenant’s fork, or even somewhere lower on the tree. Then it will get bundled into another merge request and make its way up until eventually it hits the mainline.

    No doubt Windows is a much larger and more active codebase than Linux. It could still probably go a long way towards solving this problem by adopting a tree of forks model rather than having every MR go right to the top.

    Granted, this creates some other issues when conflicts arise between merge requests at the high levels far away from the original developers. That is why git makes sure all developers names and emails are stamped on their commits. When a problem arises, you know exactly who to call.

    Of course, the queue isn’t really a bad solution either. It would be cool for such a queuing system to be made freely available to anyone else who might need it, few that they may be.

    1. Indeed, Windows uses a hierarchical branch system as well. But even within a branch there can be a lot of contention.

  9. Arek says:

    Amount of commits it’s not a measure of good software quality…

    1. The topic wasn’t quality. It was scalability.

  10. Prashanth says:

    With so many merges and files, what’s the workflow like for running tests/CI?

  11. Pelle says:

    “‘For example, the linux kernel repository at roughly eight years old is 800–900MB in size, has about 45,000 files, and is considered to have heavy churn, with 400,000 commits.’ I found that adorable.” The wast size of the Windows repo is nothing to be proud of, to be honest… I’m more impressed by how small the Linux kernel is.

    1. But if you’re going to tell the Windows team that they should switch to git because it can handle even repos as large as the linux kernel, they’re going to laugh. (This is not a discussion about which kernel is better. It’s a discussion about whether git can handle large repos.)

    2. Thomas Harte says:

      I’m no expert, not by a great distance, but if sounds to me like the Ars comment doesn’t compare like for like. How would you expect the size of Windows 10, an operating system, to compare to that of the Linux kernel, which is just the kernel for an operating system?

    3. Brian_EE says:

      Pelle – apples to apples. Linux kernel is one thing. An entire distribution is another – hundreds of separate packages bundled together. Windows OS is like the latter, not the former.

  12. Qb says:

    I wonder why the entire windows source is handled in a single repository?! Wasn’t there a way to modularize it?

    1. See this blog post from Brian Harry which discusses the rationale for a monorepo.

  13. Mike Diack says:

    Normally, I’m a fan of your posts Raymond, and have learnt a huge amount from you. But the smugness about Linux Git commits is rather unpleasant. Don’t forget that the Linux community could equally have a very easy time pointing out the instability and lack of performance of the Linux kernel compared to Windows – and don’t get me wrong, I’m a huge fan of both OSes – but the smugness leaves a sour taste in the mouth……

  14. Calvin Spealman says:

    Calling the Linux kernel “adorable” is childish and unprofessional, especially today when Microsoft as an organization is so wonderfully open-armed to the wider community. Please learn some basic decorum.

    1. Doug says:

      He’s not calling the Linux kernel adorable.

      He’s saying that it is adorable to use the Linux kernel as an example of a large repo with huge churn that is evidence that Git can scale to Windows-sized repositories.

    2. Mantas says:

      It is also childish for said wider community to call it “Micro$haft Windoze”, but they do it anyway.

  15. Torkell says:

    It’s always amusing when you break someone’s idea of what they think is “crazy big” by an order or two of magnitude. At a previous job, we once had a consultant come in to show off a shiny high-performance database engine… which spectacularly failed to keep up with 12 hours worth of simulated customer traffic.

  16. cheong00 says:

    I got the impression that anything networking related happening within Microsoft can be more challenging than what we imagine.

  17. James Sutherland says:

    Back in 2005 I was hired by a startup of civil engineers doing some financial modelling for the construction industry. They were talking about “big data”, wondering if they should use Oracle for the “big” data files they were working with.

    Spoiler: the correct tool for that job was actually a Perl script, since the “big” data was well under a megabyte. The biggest challenge was that the figures had all been manually typed (so a lot of names didn’t match exactly, and tended to meander haphazardly from column to column within a spreadsheet: when you do all the addition manually, who cares if the price is in column K or column J, or switches between the two part way through?!) – and a few were even supplied in hand-written form. (A decade later, doing very different stuff with related data of a few hundred Mb at a time, we finally had justification to move to MySQL.)

    A decade before this, we had a ticket open with (large OS vendor) about authentication db replication breaking if you created/deleted more than a thousand users in a minute. Which, of course, is pretty much an annual event in a university with an efficient automated admissions system feeding the IT department. (After I moved somewhere larger, we broke Solaris in “fun” ways which I think started with mounting the same file system in more than 256 places, then got weirder – obscure use case involving lots of chroot jails for departmental hosting with containment, back before VMWare had any server options.)

    To return fire on the MS-Unix side: around the same time, someone from MS was getting a machine room tour and suggested Exchange (c Windows 2000 era) for hosting the email. Given the usage figures, as I recall they came up with a figure of 120 x86 servers in a cluster to host that lot – at which point, they were led towards the two small Sun boxes which were doing the job instead. (Later upgraded to three, then replaced with about a dozen Linux machines as the workload increased.) To be fair, Exchange actually seems to scale competently these days thanks to all the bloodshed migrating Hotmail to it…

    Of course, in 20 years time, this post will probably be laughed at by undergrads hitting limits when their project’s AI generates more than 500 commits a second in a metaprogramming project. (Teaching Intro C++ lab last year, one undergrad came to my drop-in clinic for help with his project’s slow compilation times: his several hundred source files were taking about an hour to compile in Visual Studio. Needless to say, his game engine wasn’t actually Intro C++ coursework, but it was a quiet lab and I like interesting challenges… Funny to think how big a job that compilation would have been back when I was an undergrad!)

  18. Osexpert says:

    The switch to git feels forced to me. Since you needed to do several modifications to make git usable, maybe similar investment in other vcs would make it usable too (including the one made by microsoft). Or maybe git is just that much better.

    1. Brian says:

      I’m pretty sure there was a lot of pressure for Windows to use TFS source control. That they didn’t, would tend to indicate to me that Windows was too large a project for TFS (source control). I know that other large MSFT products do use it (or did use it), but nothing is quite as complicated as Windows.

  19. Marvy says:

    Just read the linked article. The claim is that the system can handle 400 pull requests per hour for a single branch.
    That’s one every 9 seconds. This is madness. My personal idea of “moderate churn” is several pull requests per week.
    EVERY 9 SECONDS. I have no words.

Comments are closed.

Skip to main content