How TFS stores files and calculated deltas on versioned files

 

Recently I got a request from a customer to help them estimate the expected size of a proposed TFS 2010 environment, and to help them estimate a ten-year growth prediction. This customer was focusing narrowly on the overhead required to create a new branch. This customer was asking "how much database space is needed when creating a new branch?” With this information, this customer believed they could “know how many branch they can cut, and how much space they need to add on storage”

Some of their specific questions include:

  • When creating a new branch (for example branch MAIN -> DEV), what actually happens to the data?
  • Are all files from the source branch (MAIN) copied to the target branch (DEV)?
  • Is just metadata added to the target branch (DEV) when it is created (such that no file copies occur)
  • What is saved when a file is edited in this new branch (DEV) and changes are checked-in to the branch?
  • Is a new copy of the entire file containing the changes saved in the branch (DEV)?
  • Is only a delta of the changes made saved when the file is checked in to the branch (DEV)?
  • What happens to database size when a merge operation (reverse integration from DEV to MAIN) takes place?
  • If some changes are merged from a source branch (DEV) back to a target branch (MAIN), will the size of the target branch (MAIN) only increase to reflect the delta, or will a new copy of the changed file(s) be saved to the target branch (MAIN)?
  • Is there guidance for estimating database sizing when creating new branches?
  • What do we tell customers about planning branches regarding to server capacity (disk space, database size, and so on).
  • If TFS always saves deltas (diffs) when changes are checked-in (regardless if this is the result of creating a new branch, or editing file and checking in changes to the same-branch), isn’t it pretty time-consuming to get the latest version of a file that has been edited frequently and has lots of changesets.

To answer some of these questions it helps to have an understanding of how TFS 2010 stores changes to files. First I want to thank the TFS Product Group, in particular Justin Pinnix for contributing these technical details and giving me permission to blog on this.

TFS uses a “reverse delta process” to store changes. The best way to understand this process is to start by considering a single file. When a file is added or changed in TFS Source Control, it will be stored in one of three ways:

  1. As a full, uncompressed file
  2. As a zipped file (using Gzip)
  3. As a delta (storing just the diffs between two versions of a file)

When a file is initially added to TFS Source Control, it will be stored using method 1 (full, uncompressed file) or 2 (compressed or zipped), whichever is appropriate based on the type and size of the file. 

Next, a user checks out this file, makes changes (edits), and checks-in a new version of the file with these changes.  When the check in happens, a new (second) full version of the file is saved to the database using the same compression rules as the first version.  If both the current version and previous version of the file are less than a certain size (16MB), the previous version will be considered for “deltafication”, which is done by a background job.  This background job computes a delta (between the current and previous version).  If size of the resulting delta is smaller than size of the previous version of the file, the delta will replace the previous version of the file.

When you create a new branch (for example from MAIN -> DEV), a shallow copy of each of the files on the source branch (MAIN) is made. Following this, both branches refer to the same copy of each file (the file content), and only metadata is copied to the new branch (DEV). Thus, when a new branch is created, only metadata is copied. Only when the first real change is made in this new branch, is a new deltafication chain is started for changed file(s).

Following the branch operation, a user checks-out a file from the source branch (MAIN), makes changes (edits), and checks-in a new version of this file with these changes, The previous version of this file (which is now used by both the MAIN and DEV branches in this example) will be considered for deltafication. 

However, a user checks-out a file from the target branch (DEV), makes changes, and checks-in a new version of the file into this branch (DEV), a new chain is started and no deltafication happens.

Next, if the user once again checks-out this same file from the target branch, makes changes, and checks-in a new version to this branch (DEV), then the previous version of the file in the target branch (DEV) will be considered a candidate for deltafication.

TFS will also compute deltas for binary files. It is difficult to predict database growth based on the sizes of these deltas. The results from deltafication vary greatly depending on what type of file it is, and the nature of the changes.

Another consideration is to understand that adding lines to a file are far cheaper than deleting lines from a file, because these adding lines to a file introduce no new content in a reverse delta scenario.  As an example, start with the current version (version 1) of a file with the following content:

    LineA
    LineB
    LineC

Next, check-out this file, update it to look like this, and check-in a new version (version 2):

    LineA
    LineD
    LineB
    LineC

A traditional forward diff algorithm would produce a forward delta (from version 1 to version 2) of:

       At line 2, insert the text “LineD”

However, TFS uses reverse deltas (from version 2 to version 1), so the delta would look like:

       Delete line 2

The reverse delta is smaller, because it is just a line number and does not have to include any new content (text).  The algorithm TFS uses is more complex, but it has the same basic property of adds to a file being cheaper than deletes from a file with reverse deltas.

Adding and deleting entire files doesn’t significantly affect the space used, since it’s only a metadata operation.

The TFS deltafication (diff) algorithm makes heavy use of compression, so compressibility of the data makes a big difference.  This is the same algorithm used for Windows Update, so Win32 EXEs and DLLs compress quite well with this algorithm.

When storing a full version of a file, TFS will attempt to GZip it if it is less than 2GB. If the file is larger than 2GB, or if the compressed version doesn’t make an improvement, then TFS stores the file uncompressed.

Shelvesets also use the same mechanism for storing their content.  However, with shelvesets, no deltafication happens. Each shelveset gets a new copy of its file(s). (except in the case of a merge).  When a shelveset is checked-in, a shallow copy occurs and the committed version of a file uses the same content as the shelveset copy of the file referenced.  Deltafication will be started on the previous version of the file(s).

This deltafication mechanism applies to the database storage layer.  Both the Application Tier (AT) and proxy (cache) servers cache files that go between the clients and the database.  This cache saves full versions only, no deltas.  If a previous version of a file becomes popular (frequently retrieved from the server), each request will not have to pay the cost of reconstituting it from deltas.  Once the cache is populated with a full copy of a specific version of file, it can be retrieved as quickly as the current version is.

The diff algorithm itself is based on a compression algorithm.  So, you can think of the deltas as compressed.

Note: The #1 space complaint the TFS team gets for TFS 2010 is not related to version control files at all, but rather is related to the size of test result attachments.  Test result attachments are best cleaned up using the Attachment Cleaner Power Tool.