WSS Rant - finding out what changed is much more difficult then it should be. (Lessons 4 and 5)

In my previous rant I made this statement (quoted out of context):

Migrating WSS content to TFS? Piece of cake. Just enumerate each WSS change and replay the action in TFS. If you are doing an incremental migration then just enumerate those changes that have occurred since the last time you checked for changes.

There are two base actions we need to support:

1) Enumerating all revision history for a document library

2) Enumerating all revision history for a document library after a specific point in time

At the end of the day these boil down to the same requirement (#2) with a default start time that precedes the first revision in the document library.

This should be easy, right?

 

Now in my mind I think that this should not be a big deal. All of this data is stored in SQL. There is a single SQL query that can be crafted to find this information very efficiently. Even if the document library had millions of items and was located on a slow and unreliable connection … this should be able to run quickly and without a lot of fuss.

Let’s find out!

 

So let’s go back to the WSS Web Service documentation and start looking for something that looks promising. “Document Workspace” (DWS) is the obvious starting point. When you visit the DWS front page it is sparse to say the least. It includes 5 words. The title “Document Workspace Web Service” and a link “Methods”. No summary of usage or sample code. Not an auspicious start.

So I click on “Methods” and this page has one-fifth the content of the previous page. It simply reads “Methods”.

So here is the first rant.

What the $*#@??!!?? “Methods”? That’s it? Someone decided to publish this? Is this a joke? I’m sure the answer has something to do with automatically generated content and blah-blah-blah.

So I used the search box. I entered “Document Workspace Web Service” into the MSDN search box and the first result in the search engine was this. This looks more promising. We’re up to nearly 20 words now. And that hyperlink looks good.

Lesson 4: MSDN has multiple active pages of documentation for the same logical thing. Some pages are horrible. Some are reasonable. Some are great. When in doubt, search.

Anyway – long (even by my standards) story short … the DWS web service was not what I wanted. It includes methods for creating, deleting and modifying document workspaces. I want to enumerate the contents of one.

Eventually I discovered that what I want to do is use the Site Data web service which has the SiteData class that has a EnumerateFolder method. With the results of all that I could use the Versions web service which exposes a Versions class whose GetVersions method returned what I wanted.

It didn’t take a lot of time to whip up a sample that enumerated the contents of a WSS service. The tedious part was parsing the XML responses (it turns out that parsing responses is a reoccuring theme in this project).

So now I’m sitting here with a code snippet that allows me to enumerate each revision of each item in a WSS document library.  But this isn’t really what I wanted. I wanted to know every revision after a specific point in time. I spent a few hours searching the web, reading documentation and talking with folks more knowledgeable than I about WSS, and what I discovered was this…

There is no way to do this in WSS 2.0. WSS 2.0 does not include the search web service (which apparently can be jury-rigged into doing this using query constraints – but I never researched it to prove that). So my options were limited.

 

"I still haven't found what I'm looking for."

                              - U2 (The Joshua Tree)

 

Every time I want to identify changes to migrate I have to enumerate the entire tree structure and prune those items that are older than the range start time.

So while we’re enumerating things let’s enumerate a few problems with this approach:

1) Enumerating is a somewhat expensive operation that will result in something like 2*N round-trips to the server each time the migration process begins (where N is the number of versioned items in the WSS repository).

2) Enumerating is a slow operation (at least we're doing the expensive thing slowly ... right? Or are we just doing the expensive thing for a long time? You decide.).

3) The contents of the server are changing during our enumeration. Remember that we want to identify everything that has occurred since the last migration time. What if the enumeration process takes 8 minutes – should we consider documents added during those 8 minutes? It turns out we should not. If we do, we run the risk of missing or repeating revisions. If you don’t get why – leave me a comment and I’ll explain it in more detail. So really we need to filter based on items that occurred after the previous run and no later than the start of this run.

4) WSS uses one minute timestamp granularity. Why is this important? If three documents were added during the same 1 minute window we need to make sure we either record none of them or all of them. We won’t know how many to capture until the entire minute is over. This means that when we enumerate documents we don’t filter the based on the start time through to the current time – we filter based on the start time through to one minute prior to the current time.

Let 3 and 4 sink in for a minute.

They are a pretty big deal.

We can never know when we are done. The act of looking for a change takes so long that another change may have occurred during that time and we could have missed it. So we need to look again. Which takes a long time. Lather well. Rinse and repeat as necessary.

In summary…

 

So let’s review our original goal:

“Enumerating all revision history for a document library after a specific point in time”

Which has now become:

“Enumerating all revision history for a document library since a after point in time but before one minute prior to the start of the enumeration period.”

Lesson 5: Just because something can be done efficiently in the native storage system does not mean that an efficient means of doing it will be exposed to the user.

This is a good lesson for the TFS APIs too. We have several areas where it is more difficult than it should be to do things that seem like they should be trivial (searching changeset comments and annotations are two that come to mind quickly).

Is it just me?

When I’m writing code that feels wrong I tend to believe that I am the problem. That I haven’t invested enough time into learning how to use the API effectively. That may well be the case. I’m flying by the seat of my pants here. Am I missing the boat on this? Is there a “EnumerateAllChangeSince” method I didn’t see in the wsdl? Is there a better way to figure out the contents that are interesting for migration? And most importantly – is the criteria I’m using to figure out the interesting items correct?