Transforming Word XML to XSL-FO

We put an article up back in the winter on transforming from WordprocessingML into XSL-FO. From there, you can go into other formats like PDF. Not sure if you guys have already seen this, but if not you should check it out:

Moving into formats that are fixed formats are pretty difficult because if you really want full fidelity you need to be able to also understand Word’s layout functionality. Fixed formats are formats that describe how the text and information is laid out on a page. PDF and XPS both have examples of fixed formats. The Word format is a flow based format. If you add a paragraph somewhere in the WordprocessingML, then when you open the file back up, the page layout will of course be different (everything after that paragraph just got shoved down). This of course means that we don’t store information like page breaks in the format. If we did, and we enforced it, then it would be significantly more difficult to work with the files as you’d have to recalculate those things anytime you modified the text.

I love examples like this that show some of the stuff you can do once you get Office documents in XML. I’m trying to gather a list of similar articles we should provide as we go through the Betas and people start working with the new formats. Are these kind of articles useful? Are there other similar articles you’d like to see? At one point we had started to build a similar one showing how to go into DocBook, but I’m not sure what happened with it. I’ll see if I can dig it up.


Comments (15)

  1. Tobek says:

    Hey Brian,

    My $0.02.

    IMHO, the *only* reason folks convert Word docs to pdf is to ensure that the pagination stays static, and you have a fixed WYSWYG document.

    If there was a "freeze document" option in Word that removed the flow and fixed the layout, we wouldn’t need Acrobat OR Metro.

  2. Evans says:

    Tobek, I disagree. I work a lot with law firms, and the primary reason we convert Word documents to PDF is to reduce/remove metadata and ensure it’s a (universally "read only" document. There are metadata removal tools that we use, too, but PDF is seen as the ultimate metadata eliminator (yes, I know it’s not strictly true).

  3. BrianJones says:

    Those are both interesting comments. Tobek, when you say "freeze document", do you mean just essentially making it read-only? Or is there more to it than that?

    Evans, why do you think it is that PDF is seen as the ultimate metadata eliminator? I’ve seen plenty of PDF files with additional information. Is the issue that it’s harder for people to see that hidden data? Or is there just a misperception of how locked down and clean a PDF file is?

    Have you seen this add-in for Office that let’s you remove all hidden data from an Office file:

    Do you think tools like this would give people more confidence or is there something else to it as well?


  4. Ali says:

    I’ve certanily seen the confidence that people have in PDF, as Evans mentions. Part of that, from what I could gather, has to do with the UI of the reader, of all things. With a really simple and minimal UI, there is a feeling that there’s no room to hide the metadata. I recall one person saying to me, of word, "with all those menus and dialogs, you never know where they have my salary written in there."

  5. Tobek says:

    OK, so I should have been clearer. Removing metadata for us is important, and I get folks I work with to use Word’s metadata removal tools to do this.

    What we’re after is:

    1. Documents with edit and/or view password protection (and content encryption), that can be rendered permanently read-only if and when needed – though still allowing content extraction or generation of an editable copy while maintaining integrity of the original.

    2. Certainty that the document will paginate and print the same regardless of what printer it is sent to, what paper it is printed on (eg A4 or US Letter), etc.

    This is essentially though not explicitly demanded of us by regulators (I work in pharmaceutical R&D, so that means FDA here in the US). And their requirement is pdf format – not because the individuals at FDA care whether it’s in pdf or whatever the hell it is, but because pdf meets the above stipulations. This ends up requiring us to maintain at least 2 versions of a given document: a Word version for future use (e.g. content extraction) and a pdf version (the facsimile archive copy of what we gave the regulators). Total PIA.

    We’d much rather be able to do the whole bunch in Word (or whatever other Office program is applicable – we have to maintain archival copies of certain emails for example) and not be having to deal with multiple electronic files for a given "document".

    Note that we’re still using a system of paper-based originals, not trying to be 21CFR Part11 compliant. If you don’t know what that is, count yourself lucky!

  6. Keith says:

    >At one point we had started to build a similar >one showing how to go into DocBook, but I’m not >sure what happened with it. I’ll see if I can >dig it up.

    I would _love_ to see that.



  7. Evans says:

    Brian, I, too have seen PDFs with additional information. The vast majority of legal documents are repurposed to make new legal documents. Attorneys sometimes make a lateral transfer to a different firm, or a client will switch counsel. In both cases, the documents will frequently travel with them. More than track changes, the document stats and document properties, last editor, previous file locations, etc., can divulge information that is embarrassing at the least. It’s not uncommon for us to use documents created years ago. Clients don’t really like it when you’re billing lots of hours for a document created in 1993.

    I’ve used the Remove Hidden Data Tool, but it doesn’t adequately meet the needs of the legal community. I’ve found people freak out when they receive a Word document that doesn’t behave like a Word document. Another application, iScrub from Esquire Innovations (, has a "Metasealant" feature that produces a protected Word document that. I’m not especially fond of that, either. I need to re-read the XPS specs, but part of that made me a little uncomfortable.

    There is definitely a false sense of security in the PDF format. Users rarely apply security to their PDF documents. A major factor in the PDF vs. protected Word document is the user experience. The Acrobat reader is pretty universal, and people are very comfortable with it. As in Tobek’s example, most courts require documents filed electronically to be PDF. Unlike Tobek, we rarely have to keep copies of every single document that goes out the door (thank goodness!), although the document management systems frequently used log that event.

    I’m really excited about the new format and what it will allow us to do. I’m especially thrilled about the document stability / document corruption features built in. That’s one of our biggest problems. I’ll have to look at the article you referenced, Brian.


  8. AC says:

    Brian, that tool to clean up word documents hardly works at all. Try getting a document that’s been edited a lot, run it through the tool, and then do a "strings" on it (view all the strings in a hex editor if your OS doesn’t have an equivalent command). You’ll find yourself horribly surprised. I’ve lost all confidence in Microsoft coming up with a useful security or privacy tool.

  9. Evans says:

    I agree with AC in large part. While the RHDT does a better job than nothing, it’s far from a "solution." The legacy binary is a horror story. You really do need a 3rd party app to do the job. However, the XML format shows great promise. Without seeing the schemas, it’s hard to really tell. From what I’ve read, it might be as easy as programmatically deleting two XML files in the new docx "package."

    That would be most cool.


  10. BrianJones says:

    I’ll dig into the tool a bit more and see what the issues are.

    You guys are right that this will be much easier with the new formats. It won’t be as simple as just deleting a couple parts, but it will be relatively easy. We call this kind of data PII (Personally Identifiable Information). The definition of PII is actually different depending on who you talk to, and it even changes over time. The new formats will give that flexibility so that we can publish how all the information is stored, and people can build tools that allow you to remove whatever you want. It’s pretty exciting.


  11. BrianJones says:

    AC, do you have an example file you could send me that is failing when you run the tool? What is the data you’re seeing that you’re concerned about?

    Here’s a link to get in touch with me:

    I really appreciate you’re feedback. I’d like to figure out what’s going on here.


  12. AC says:

    No I can’t because they’re docs from work (I don’t use Word at home). Some things I remember include user names (besides last one to save), previous file paths, footnotes that do not show up on the doc, and other strings that don’t mean much to me, but look interesting for a snooper. Just get some old document at Microsoft, that’s been edited by many people, and try it.

  13. Craig Ringer says:

    There certainly is a false sense of security with PDF – much more than most people are aware of.

    First, the PDF encryption is often not hard to crack. Additionally, you don’t even need to crack it if it petmits viewing without a password – you can simply use a tool such as GhostScript that’s not aware of PDF encryption to convert it to PostScript, TIFF, etc (then from PostScript back to PDF – sans encryption – if you like).

    To make matters worse, PDF is hardly read only in any sense. It’s difficult to modify by hand, but not impossible. Tools like Enfocus PitStop professional can make editing PDF powerful and easy. I regularly make corrections on PDF documents sent in by clients, including fixing colours, replacing fonts, changing text, replacing or extracting images, and more.

    Relying on PDF’s read-only nature (including write-protect encryption) is effective only in the same sense that software copy protection is effective. It stops the laziest, most casual of attackers, and might slow the rest down a little bit.

  14. Josh Mahowald says:

    I can’t be the only one who wants to go in the reverse direction can I? I have reports that typically need to go out as PDF, and occasionally be read into Word. I was hoping to be able to put out XSL-FO, and run a FO processor for PDF, and XSLT from XSL-FO to WordProcessingML, but it looks like to do so I’d need to write the XSLT myself. I hope someone here tells me I’m an idiot and points me in another direction . . .

  15. Kevin Brown says:

    Well, we did the orginal conversion styles and many enhancements since. If this type of technology interests you, you may wish to try out the beta from our new partner.

    They enhanced the technology to make Word into an XSL Style Designer for designing both WordML as well as streaming input to RenderX XSL FO engine.