Why does cloning from VSTS return old unreferenced objects?


Note: “core Git” refers to the official base Git implementation, as opposed to Visual Studio or GitHub, or VSTS, which may involve non-standard implementations or behavior.

 

A customer asked:

We removed some unwanted binaries from our repo on visualstudio.com by following the instructions at https://help.github.com/articles/remove-sensitive-data/. We force-pushed to master and deleted all our other branches.

After running git gc locally, our local repo is now 5 MB, but git clone from visualstudio.com still returns 100MB. The old unreferenced blobs are still being sent down by the server.

How do we git gc (or some equivalent) on the server as well?

There are two issues here:

  1. There is no equivalent to git gc on VSTS yet.

    Our server preserves the history of every ref/branch update to Git repos, including deleted branches. This is analogous to the “reflog” in core Git. On VSTS, we expose the reflog via the REST API and the Branch Updates (i.e. pushes) tab in Web Access. Similarly to core Git, objects in the reflog are still considered to be referenced and will not be deleted by git gc. Core Git can eventually prune old reflog entries via git prune or git gc, but VSTS does not have that functionality yet.

  2. Large fetches are expensive for the server to calculate, so we cheat a little.

    Large fetches (and clones) have historically been very expensive in both core Git and VSTS due to the “counting objects” phase. http://githubengineering.com/counting-objects/ has a nice explanation of the problem, as well as how core Git and GitHub have (cleverly) improved the perf w/ bitmap indexes.

    Unfortunately, VSTS does not have that perf fix yet. Instead, it cheats a bit and blindly streams back every object that exists on the server if the client has nothing and is asks for all branches and tags (e.g. for git clone). This is generally reasonable, until a user decides to dereference most of the objects in their repo to save space!

I suspect that the customer would not have minded the lack of gc in his scenario if we only sent reachable objects during clone.

Until these issues are fixed for VSTS, what workarounds are there?

  • Delete the repo from the server (EDIT: or rename it) and re-push it.

    This works, but is sub-optimal.  In the new repo, you won’t be able to see old pull request details, branch update history, and any links from other areas like builds or work items.

  • Trick the server by not cloning everything at once:

    mkdir newRepo
    git init
    git remote add origin 
    #fetch one branch first
    git fetch origin master
    #fetch everything else
    git fetch origin
    

Comments (2)

  1. I’ve just spent a better part of the day trying to shrink the repo on TFS 2015, now I know why everything failed. Thanks for this post, it is really helpful.

    What are the plans for fixing this?

    1. We’re actively working on the perf improvements that would allow us to remove the perf hack so clones only send reachable objects (so your clones will be smaller). Actual git gc on the server is in our backlog, but isn’t scheduled yet.

      Sorry for not seeing this earlier! The comment notification/moderation settings got reset when we switched our blogging platform to WordPress.

Skip to main content