ContentSync – a content-based file copy tool


Suppose you need to sync the contents of two large folders, Source and Destination. Normally robocopy *.* Source Destination /MIR does the job. However even if the byte content of a file didn’t change, but the timestamp did, robocopy will copy the file and change the timestamp of the destination to match the source.

This is the safe and expected behavior, however there are whole classes of tools that track file modification based off the timestamp alone. For instance, MSBuild will trigger a cascading rebuild of all dependencies of a file if its timestamp has changed (even if the actual file contents is exactly the same). This is called overbuilding. A scalable build system should detect that the file contents didn’t change and avoid doing any work in this case.

Or take another example, suppose you need to upload hundreds of thousands of files to Azure using MSDeploy. If the timestamp on those files has changed, MSDeploy will upload those files even if the actual content is the same.

In general, if you’re deciding whether a file was modified based off the timestamp, you’re bound to schedule unnecessary work that could have been avoided if you checked whether the actual file bytes have changed.

Long story short, I wrote an open-source tool to sync/mirror directories based on the file content, not the timestamp:

https://github.com/KirillOsenkov/ContentSync

Usage: ContentSync.exe <Source> <Destination>

I needed this for the exact Azure website deployment scenario: I maintain http://source.roslyn.io, and we’ve recently moved from on-premises Microsoft hosting to Azure. I use https://github.com/KirillOsenkov/SourceBrowser to regenerate the website every night, and that’s 160,000 modified files totaling over 1 GB. You don’t want to upload that to Azure every night.

So I inserted an intermediate step into my deployment script – I ContentSync from the freshly built Index folder (it’s not incremental, all the files are generated from scratch every time) into a Staging folder from the last time. ContentSync doesn’t touch the files that haven’t actually changed (and that’s 99.9…%), and so MSDeploy doesn’t upload them since the timestamp is the same.

While I’m here, kudos to David Ebbo and his excellent https://github.com/davidebbo/WAWSDeploy that I use to simply deploy a folder to Azure (read more here: http://blog.davidebbo.com/2014/03/WAWSDeploy.html). It uses MSDeploy internally and hides all the details I don’t want to know about behind a super simple UX.


Comments (8)

  1. Betty says:

    Was the usechecksum option not suitable?

  2. Betty – I don't know about usechecksum option. Is this in robocopy? Do you have more details?

  3. Now this is embarrassing 🙂 Betty, thanks for the pointer 🙂 Turns out usechecksum is the MSDeploy option. I should try that. Don't know how I missed that.

    Now the whole tool wasn't really needed in my case!

  4. Well, no wonder I couldn't find the option previously. It is not listed in msdeploy /? help.

  5. Daryl says:

    Should this blog post be updated to a how to use "UseCheckSum"?

  6. Since I'm not an expert on msdeploy I'd rather not write about something I don't know about and have never used. After a quick search it seems that that functionality has bugs and maybe other things I'm not aware about. I'd rather leave this as an exercise to the reader 🙂

  7. Sergio says:

    Thanks for publishing awsome stuff here!

    I use github.com/…/Octodiff (a la rdiff – used by Octopus Deploy team to deploy release on destination machine) to deploy only changed part of package.

    octodiff supposed to work on zip/or any large binary file. It may not be so effective in your case – it's just one more solution. ^)

  8. Sergio – thanks! Octodiff seems to be completentary to this tool – it knows how to efficiently copy a single file (whereas I just use File.Copy). Also my tool is extra bad in that respect, it reads the remote file twice (first to compare, then to actually copy).

Skip to main content