Why do Office ".xml" files behave differently from other ".xml" files?


Some of you who have worked with Office 2003 xml files may have noticed that while we use the “.xml” extension, the files still show unique icons and the original application is launched when you double click them. The files are totally valid XML files following the W3C 1.0 spec. The reason they behave differently is that we put a PI (Processing Instruction) at the top of the XML file that identifies which application created the XML. Open any of the Word XML files with a text editor, and you’ll see the following:


<?mso-application progid=”Word.Document”?>


That declaration is what let’s us know that it’s a Word XML file. We do the same thing with InfoPath and Excel XML files. There is a component that we call the msoxev that sniffs files with the .xml extension and looks for that PI. When it sees the PI, it then does a lookup in the registry to see if there is an application associated with the prodig attribute. If so, it will use that application for opening and editing the file.


We also run this in IE, so if you open one of the XML files in IE, it will automatically get handed off to the proper application. This is great if you are just following a hyperlink and want to view the file with the application that generated it. If you are debugging the files or want to view the XML directly in IE though, it’s a bit of a pain. If you want to open the file in IE, and not get redirected, you have a couple options.


One time adjustment: If you want to change the behavior just for that specific document, you can open the file in a text editor and delete the PI. Then it will behave just like a regular XML file.


Permanent adjustment: This is a behavior you can easily modify if you want. The XEV mechanism just sniffs the registry to see what the content type for that file is. Go to the following: “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\11.0\Common\Filter\text/xml” and you’ll see a collection of entries. The name of the string matches the progid attribute in the PI, and the value of that string is the content type for the file. If you don’t want the current behavior, you can just delete the string or rename it, and it will now behave like any other XML file.


You can also customize this regkey to register your own applications that want to use the .xml extension.


We won’t have this issue with the Office 12 XML files, because we actually use unique icons. It was something we had discussed doing with the Office 2003 XML files but eventually decided against it. The new default formats will still be XML, but they will actually be wrapped in a ZIP container and we decided using unique extensions (.docx, .pptx, .xslx) was the best way to go.


-Brian

Comments (15)

  1. Ali says:

    If you have the Word 2003: Xml Viewer (of http://www.microsoft.com/downloads/details.aspx?FamilyID=19676b18-1bcd-4852-93ba-0b5a203ea731&displaylang=en), the content type application/msoxmlviewer is also available to be used to open your own .xml files in the viewer directly in IE. This would work well if you have one or more transforms to view your own xml files.

  2. y says:

    I’m glad that you’ve chosen unique extensions for the new formats, but I suppose this might be the place to reiterate my plug for the OpenDocument mimetype convention for zip formats, which could easily be adopted in addition to the new extensions.

    The convention is to place a "mimetype" file first in the zip archive, uncompressed, whose contents are a MIME type describing the entire document.

    See section 17.4 of the OpenDocument v1.0 standard, http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.0-os.pdf.

  3. BrianJones says:

    Y, that’s an interesting suggestion. Does that mean that the file is not valid though if that "mimetype" file isn’t the first file in the archive?

    Many ZIP tools out there don’t give you much control over the order of your parts without added complexity. We really want people to be able to use existing tools to generate these files.

    -Brian

  4. y says:

    The OD standard says "should". My interpretation would be that the positioning of the mimetype is intended for the convenience of unsophisticated tools that do not understand the file format and wish merely to identify the type of the file from its first few bytes–something that is quite convenient, for example, if the extension is somehow lost.

    Tools that understand the zip file format presumably would be able to interpret the contents of the mimetype no matter how it might be placed within the zip file. Tools that understand the precise format of the file (odt, ods, docx, xlsx, etc.) presumably would be able to deal with it properly even if the mimetype were missing altogether.

    For your formats, you could certainly choose to make the mimetype an optional recommendation, and arrange for your code to generate it.

  5. BrianJones says:

    That’s something I’ll think about a bit, but I’m not really a big fan of optional things like that when they have such a significant meaning. No tool could really rely on it if it’s only optional. If it’s not optional, then it makes it that much more difficult to create the files.

    The content type of the start part is fairly easy to determine and is done in a ZIP agnostic way. I think that most likely that’s what we’ll stick with.

    Thanks for the suggestion though. Like I said I’ll think about it a bit more.

    -Brian

  6. y says:

    What specifies which part is the start part?

  7. BrianJones says:

    Check out the example I linked to in this post: http://blogs.msdn.com/brian_jones/archive/2005/06/20/430892.aspx

    There is a package relationship file that is always here: "/_rels/.rels"

    It’s an XML file that describes all the root level relationships. The relationship of type officeDocument points to the start part.

    There is a content type file that is always here: "[Content_Types].xml"

    That describes the content types for each part in the file. You look up the content type for the start part and you’ll know what kind of file it is.

    -Brian

  8. y says:

    But how do you know a priori that the document is an Office document (as opposed to something else using Metro conventions) so that you would know the relationship you want for the start part is the one of type officeDocument? Or do you simply have to know that any document that has a relationship of type officeDocument is an Office document?

    The advantage of the mimetype convention is that anyone can identify the type of the file, and route it to the appropriate application, without having to know everything about the details of the format.

  9. Slashdot: MS Office XML Format Now in TextEdit

    I saw this the other day on slashdot. I have to admit…

  10. Wes Jackson says:

    I tried the above, linking an MS-WORD XML file from my portal (run in CMS). The <?mso-application progid="Word.Document"?> line is there, yet the file still opens up in IE. I need the users to be able to click the file link and have the macro-enabled Word document open up in Word. Is there something else I need to do to the file to ensure it opens properly?

  11. Phil M says:

    Anyone have an answer for Wes’ question? I too have the right progid in my XML file, but it still opens up in IE and then INSIDE IE opens Word. I want it to open up directly in Word, which I swear it did yesterday…..

  12. BrianJones says:

    Phil, it could be that you’ve installed the Word XML viewer (http://www.microsoft.com/downloads/details.aspx?familyid=19676b18-1bcd-4852-93ba-0b5a203ea731&displaylang=en)

    That would cause the files to open in IE instead of Word, but you should be able to choose to edit them in Word (from the shell or even within IE). When it opens in IE is it in the XML view or is it rendered with the rich formatting?

    -Brian

  13. Bill Zuck says:

    What would the mime type be for MS Wordviewer 2003.

    When we use the Word XML Viewer the html output makes the doc unformatted. and all styles are lost. The Wordviewer maintains the structure. Our goal is when selecting a link to a xml document the Wordviewer 2003 will open the file.

    Any suggestions.

    Your articles are great and very informative. I hope there is a solution to this problem.

    bzuck@adelphia.net

    Thanks

    Bill Zuck

  14. Meena Bhashyam says:

    We have a situation where we’re using the XML Viewer – Content Type = "application/msoxmlviewer". On a Windows 2003 server,  I see raw xml when I try to download a file from the browser.  However, when I save it to a folder and double click, the formatting I had applied in Word comes back. I am not sure why we’re seeing the raw xml when downloading. Any help would be appreciated. The goal is to get the formatted word document directly on download.

    Thanks,

    Meena

  15. Weddings says:

    Some of you who have worked with Office 2003 xml files may have noticed that while we use the &quot;.xml&quot; extension, the files still show unique icons and the original application is launched when you double click them. The files are totally valid