What’s up with all those “rsids”?

As many folks who worked with the 2003 wordprocessingML format have probably noticed by now, there are is a new set of attributes/elements in the Open XML wordprocessingML format that shows up all over the place. I'm talking about RSIDs.

The rsid element is used to allow applications to more effectively merge two documents that have forked. It's best to use an example for explaining the use, so let's image I have a document that has the following text (we'll call this document "Brian1"):

Clearly this is a great thing for the industry. I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

I then send this document out to my coworker Steve to review and make changes. Steve decides that he wants to add in a bit of a sarcastic remark for the first sentence so when he sends back the document it looks like this (we'll call it "Steve1"):

Clearly this is a great thing for the industry (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway). I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

While Steve was reviewing his copy of the document, I also made some changes. I removed that second sentence, so now my document looks like this (we'll call it "Brian2"):

Clearly this is a great thing for the industry. We now have an official standard that provides all the details necessary to read and write office documents.

Now, when Steve sends me his copy back, I'd like to have my word processor merge my document and his so that I get the most up to date version with both of our edits. Ultimately, the merged document would look like this (we'll call it "Final"):

Clearly this is a great thing for the industry (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway). I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.

The blue text is tracked as an insertion and the red text is a deletion.

Now, why is this example interesting at all? Well, if we only stored the basic text of this document, it would be very difficult to merge. In looking at the difference between "Brian2" and "Steve1", how would the application know what was an insertion and what was a deletion? If I still had my original file ("Brian1"), it would be easy to track this, but that's most likely not the case. I only have my edited document "Brian2", and Steve's document "Steve1". How do you know that the text "I personally feel like it's really cool" wasn't something that Steve added, as opposed to something that I deleted?

One way you can do this is via "track changes" functionality, where the application tracks the insertions and deletions as they happen and stores that in the format, but this often isn't desired. Often, for privacy reasons, people don't want to have the revisions tracked in their documents. Instead, they just want to be able to merge the documents later, and have the application figure out what was inserted, and what was deleted.

Well, the way we deal with this is through revision identifiers (rsids). Every time a document is opened and edited a unique ID is generated, and any edits that are made get labeled with that ID. This doesn't track who made the edits, or what date they were made, but it does allow you to see what was done in a unique session. The list of RSIDS is stored at the top of the document, and then every piece of text is labeled with the RSID from the session that text was entered.

This approach is what allows us to properly merge the two documents. When we merge documents, we can see what RSIDS the two documents share. Any shared RSIDS will represent text that was entered before the document was forked. Any RSIDS that are unique to one of the documents represent edits that were made after it was forked.

This means that if we see text in one document, but not in the other, all we need to do is look at the RSID applied to that text. If it's one of the shared RSIDs, that means the text existed before the documents were forked. That also means that when we merge the documents, we can assume that the text was deleted from one of the documents, rather than added to the other.

Let's go back to our example. In the original file, the XML would look something like this:

<w:body>
<w:p w:rsidRDefault= "00544FOB" >
<w:r>
<w:t> Clearly this is a great thing for the industry. I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
</w:r>
</w:p>
</w:body>

This is saying that all runs (<w:r>) in the paragraph by default have the RSID "00544FOB". And in the document settings, we would have "00544FOB" listed as one of the RSIDs for the document. (note that there are a number of other places that RSIDs show up, but we're only focusing on the text for this case).

Now, after the document went to Steve, and he made his edits, the document "Steve1" would look like this:

<w:body>
<w:p w:rsidRDefault= "00544FOB" >
<w:r>
<w:t> Clearly this is a great thing for the industry</w:t>
</w:r>
<w:r w:rsidR= "00FF1F58" >
<w:t>(unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway)</w:t>
</w:r>
<w:r>
<w:t>. I personally feel like it's really cool. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
</w:r>
</w:p>
</w:body>

Notice that while the formatting properties on all three runs are the same, the RSID values are different. This happens because Steve added that additional text, so it was assigned to a new RSID value "00FF1F58". If you look in the document settings for this document, there will be two RSIDS: "00544FOB" and "00FF1F58".

Now, separately I opened my copy and deleted some text. So the document "Brian2" is going to look like this:

<w:body>
<w:p w:rsidRDefault= "00544FOB" >
<w:r>
<w:t> Clearly this is a great thing for the industry. We now have an official standard that provides all the details necessary to read and write office documents.</w:t>
</w:r>
</w:p>
</w:body>

Notice that the runs in the paragraph all have the same RSIDs still. There aren't any new RSIDs in the body because I didn't add any text. I did however edit the document, so if you look in the document settings, there will be a new RSID. So in "Brian2", we have the following two RSIDs: "00544FOB" and "00A95BA5".

So, when we go to generate the "final" document, we merge "Brian2" with "Steve1". As we merge the two documents, we see that they share the RSID "00544FOB", but that all other RSIDs are unique to those copies. This means that any text with the RSID "00544FOB" existed in the original file, and any other text was added after the fork. There are two pieces of text in Steve's document that aren't in mine. The first piece of text that reads " (unless you happen to be one of those folks who had investments in growing this myth that there was some kind of "file format war" underway) " was an addition made by Steve, rather than something I deleted. That text had an RSID unique to Steve's document. The other text that reads: "I personally feel like it's really cool." on the other hand has an RSID that is shared between the two documents. That tells us that it was deleted from my copy, rather than added to Steve's.

So, next time you're looking at a wordprocessingML document and you're wondering why it's broken out into so many runs, you'll know the answer. This is another example of how the simplicity of the flat schema wordprocessingML uses makes it easy to add properties to the various runs of text. The RSID isn't a container, but rather just a property of the text. If we had the ability to nest runs within other runs (similar to the HTML <span> model), then it would be a bit more complicated (not impossible, just more complicated). The architecture of a wordprocessing file is much simpler. Since runs can be nested in other runs, you have a more predictable ancestor list to walk through when finding the properties of that particular run.

If you would rather not have these RSIDs in your files, it's easy enough to turn off. Just go to the trust center and turn off the setting: "Store random number to improve combine accuracy"

Two other important things to note. First is that the RSID tells us nothing about the time or order things were done. They are completely random, and are only used for seeing where things match. So they aren't of much use unless you are merging with another document that also has RSIDs. Another thing to note is that these are not just used for content, but other settings as well like styles, layout, etc.

-Brian