WSS Rant – Linking to the latest version of a sharepoint document considered harmful. (Lessons 6 and 7)

SharePoint does not have persistent hyperlinks for all document revisions.

More specifically – SharePoint has two methods for linking to documents: the canonical path and the revision paths.

The canonical path is the latest copy of the document at the time of the download request. Every time a new revision is created the contents of the canonical path change. Canonical paths look like this:

https:// <servername> /Shared%20Documents/document.doc

Revision paths are paths to historical revisions of the document. More specifically they are paths to the non-tip (latest) version of the document. The revision path exists under the _vti_history location and have the version number in the path. They look like this:

https:// <servername> /_vti_history/7/Shared%20Documents/document.doc

That link is for revision 7 of the document.

When I first saw this I assumed that all documents were accessible via revision paths, and the canonical path was just an abstraction over the latest revision document. I was incorrect.

You can prove this to yourself by taking any document in a WSS repository and trying to request the latest version by the revision path that would contain its content. For example if you add a new file it has only one revision. So its revision path (if it could have one) would be:

https:// <servername> /_vti_history/1/Shared%20Documents/document.doc

Requesting this fails with a HTTP 400 response (HTTP BAD REQUEST). Requesting the canonical path works as expected. If a new revision of the document is created the revision path for version 1 becomes valid and the canonical path now refers to version 2 (and as before – the revision path for version 2 is not valid).

Why is this a problem?

Enumeration of the document library is slow. The time between when we analyze the changes to migrate and the point where we actually perform the migration could be seconds, minutes or hours apart. During that time the document may have been updated. When that happens the version we cared about is demoted from the canonical path down to a revision path. If we downloaded the canonical path we would end up getting the wrong content.

At first my plan was to store the canonical path and then to check during migration to make sure that canonical path is still the version we want before downloading. But that just narrows the window on the race condition – it does not eliminate it.

I ended up doing the following:

When analyzing an item we know if we are looking at the latest version (canonical path) or a historical revision (revision path). We store this fact as a boolean along with the path necessary to download the document (the canonical or revision path) and the version number of the item. During migration we download the item from the stored download path. Next, if the document was the latest version (i.e. a canonical path) we now double check that the canonical path refers to the document version we needed. If it does we are done. If it does not then we enumerate the revision history to find the revision we want. If it exists we download the specific revision and continue. If it does not then the revision was deleted and we handle that condition.

I believe this approach guarantees that the content we end up is always the proper content and does so in the minimum number of round-trips to the server.

All because canonical paths are not just an abstraction over revision history.

Lesson 6: The conceptually simplest operation can end up being incredibly complex to perform correctly.

Lesson 7: When representing data (such as document revisions) make all the data available through a uniform mechanism. Shortcuts to the latest revision of the data should be an abstraction over that mechanism.