Git perf and scale

Matt Cooper

November 2nd, 20160 0

New features and UI changes naturally get a lot of attention. Today, I want to spotlight the less visible work that we do on Team Services: ensuring our performance and scale meet our customers’ needs now and in the future. We are constantly working behind the scenes profiling, benchmarking, measuring, and iterating to make every action faster. In this post, I’ll share 3 of the dozens of improvements we’ve made recently.

First up, we’ve sped up pull request merges significantly. We have an enormous “torture test repo” (tens of GBs across millions of files and 100K+ folders) we use for perf and scale testing. Median merge time for this repo went from 92 seconds to 33 seconds, a 64% reduction. We also saw improvements for normal-sized repos, but it’s harder to generalize their numbers in a meaningful way.

Several changes contributed to this gain. One was adopting a newer version of LibGit2. Another was altering LibGit2’s caching strategy – its default wasn’t ideal for the way we run merges. As a customer, you’ll notice the faster merges when completing PRs. For our service, it means we can serve more users with fewer resources.

An engineer on a sister team noticed that one of our ref lookups exhibited O(N) behavior. Refs are the data structure behind branches in Git. We have to look up refs to display branch names on the web. If you’re familiar with time complexity of algorithms, you’ll recall that O(N) behavior means that the work done by a program scales linearly with the size of the input.

The work done in this particular lookup scaled linearly with the number of branches in a repository. Up to several hundred refs, this lookup was “fast enough” from a human’s point of view. Humans are quite slow compared to computers 😉

Every millisecond counts in web performance, and there’s no reason to do excess work. We were able to rewrite that lookup to be constant with respect to the number of branches.

The last improvement requires a bit more explanation. At various points in our system, we need to track the history of a file: which commits touched this file? Our initial implementation (which served us well for several years) was to track each commit in a SQL table which we could query by file path or by commit.

Fast forward several years. One of the oldest repos on our service is the one which holds the code for VSTS itself. The SQL table tracking its commits had grown to 90GB (many, many times the size of the repo itself). Even after the usual tricks like schema changes and SQL page compression, we weren’t able to get the table size down to an acceptable level. We needed to rethink the problem.

The team spent 3+ months designing and implementing a fast, compact representation of the Git graph. This representation is small enough to keep in memory on the application tier machines, which themselves are cheaper to operate than SQL machines. The change was carefully designed and implemented to be 100% transparent to end customers. Across a variety of measurements, we found no noticeable performance regressions and in many cases saw improvements.

We were able to completely drop the commit change tracking table, freeing up dozens of gigabytes on every scale unit’s database tier. We finished migrating to the new system over 2 months ago. Besides a handful of incidents during early dogfooding, we have not received complaints about either its performance or correctness. (I’m flirting with chaos making such claims, of course. If you have a scenario where performance regressed since the beginning of September, email me so we can investigate.)

This explanation leaves out a lot of details in favor of brevity. If there’s interest, we’re thinking of doing a series of blog articles on how our Git service works under the hood. Let me know in the comments what you want to hear more about.

Thanks to the VC First Party team [Wil, Jiange, Congyi, Stolee, Garima, Saeed, and others] for their insights on this topic. All remaining errors are mine alone.