More on GVFS

After watching a couple of days of GVFS conversation, I want to add a few things.

What problems are we solving?

GVFS (and the related Git optimizations) really solves 4 distinct problems:

  1. A large number of files – Git doesn’t naturally work well with hundreds of thousands or millions of files in your working set.  We’ve optimized it so that operations like git status are reasonable, commit is fast, push and pull are comfortable, etc.
  2. A large number of users – Lots of users create 2 pretty direct challenges.
    1. Lots of branches – Users of Git create branches pretty prolifically.  It’s not uncommon for an engineer to build up ~20 branches over time and multiply 20 by, say 5000 engineers and that’s 100,000 branches.  Git just won’t be usable.  To solve this, we built a feature we call “limited refs” into our Git service (Team Services and TFS) that will cause the service to pretend that only the branches “you care about” are projected to your Git client.  You can favorite the branches you want and Git will be happy.
    2. Lots of pushes – Lots of people means lots of code flowing into the server.  Git has critical serialization points that will cause a queue to back up badly.  Again, we did a bunch of work on our servers to handle the serialized index file updates in a way that causes very little contention.
  3. Big files – Big binary files are a problem in Git are problem because Git copies all the versions to your local Git repo and makes for very slow operations.  GVFS’s virtualized .git directory means it only pulls down the files you need when you need them.
  4. Big .git folder – This one isn’t exactly distinct.  It is related to a large number of files and big files but, just generally the multiplication of lots of files, lots of history and lots of binary files creates a huge and unmanageable .git directory that gobbles up your local storage and slows everything down.  Again GVFS’s virtualization only pulls down the content you need, when you need it, making it much smaller and faster.

There are other partial solutions to some of these problems – like LFS, sparse checkouts, etc.  We’ve tackled all of these problems in an elegant and seamless way.  It turns out #2 is solved purely on the server – it doesn’t require GVFS and will work with any Git client.  #1, #3 and #4 are addressed by GVFS.

GVFS really is just Git

One of the other things I’ve seen in the discussions is how we are turning Git into a centralized version control system (and hence removing all the goodness).  I want to be clear that I really don’t believe we are doing that and would appreciate the opportunity to convince you.

Looking at the server from the client, it’s just Git.  All TFS and Team Services hosted repos are *just* Git repos.  Same protocols.  Every Git client that I know of in the world works against them.  You can choose to use the GVFS client or not.  It’s your choice.  It’s just Git.  If you are happy with your repo performance, don’t use GVFS.  If your repo is big and feeling slow, GVFS can save you.

Looking at the GVFS client, it’s also “just Git” with a few exceptions.  It preserves all of the semantics of Git – The version graph is a Git version graph.  The branching model is the Git branching model.  All the normal Git commands work.  For all intents and purposes you can’t tell it’s not Git.  There are three exceptions.

  1. GVFS only works against TFS and Team Services hosted repos.  The server must have some additional protocol support to work with GVFS.  Also, the server must be optimized for large repos or you aren’t likely to be happy.  We hope this won’t remain the case indefinitely.  We’ve published everything a Git server provider would need to implement GVFS support.
  2. GVFS doesn’t support Git filters.  Git filters transform file content on the fly during a retrieval (like end of line translations).  Because GVFS is projecting files into the file system, we can’t transform the file on “file open”.
  3. GVFS has limits on going offline.  In short, you can’t do an offline operation if you don’t have the content it needs.  However, if you do have the content, you can go offline and everything will work fine (commits, branches, everything).  In the extreme case, you could pre-fetch everything and then every operation would just work – but that would kind of defeat virtualization.  In a more practical case, you could just pre-fetch the content of the folders you generally use and leave off the stuff you don’t.  We haven’t built tools yet to manage your locally cached state but there’s no reason we (or you) can’t.  With proper management of pre-fetching GVFS can even give a great, full featured offline experience.

That’s all I know of.  Hopefully, if GVFS takes off, #1 will go away.  But remember, if you have a repo in GVFS and you want to push to another Git server, that’s fine.  Clone it again without the GVFS client, add a remote to the alternate Git server and push.  That will work fine (ignoring the fact that it might be slow because it’s big).  My point is, you are never locked in.  And #3 can be improved with fairly straight forward tooling.  It’s just Git.

Hopefully this sheds a little more light on the details of what we’ve done.  Of course, all the client code is in our GitHub project so feel free validate my assertions.