How to Dynamically Generate Word Documents from a Web Site

On my most recent project, an ASP.NET 2.0 web site, we thought it would be useful if we offered users the ability to have the page they were currently looking at available via a button-click as a nicely formatted Word Document they could email to friends or use as the basis of a longer document for thier own use. We have to use Word 2003 or XP, wanted any images to be embedded in the document, didn't want to buy any third-party components and we didn't want to have to run a copy of Word on the Web Server. I came up with a solution and I thought I'd share it here in case anybody had a better one!

The Word Document format is binary and so pretty much only Word can generate it. We didn't want to have to put Word on the server, as I'm pretty sure that's not a supported scenario, so making .doc files was out of the question. (I did it once, it was really painful <shudder>. )

I next looked at creating a document in HTML, as Word understands it and can format it pretty nicely. The big problem with HTML is that the HTML references images, and doesn't embed them like a native Word document or the PDF format, so that was out, as the document would only appear correctly if the user was on the same network as the image and we didn't like that restriction.

So what other formats does Word support? After some searching I discovered a solution - Multipart MIME Encoding. This is a common and very standard format and has the advantage that multiple files can be embedded, even images: they are simply encoded in Base64; each file is demarcated with a multipart MIME header. I discovered that you could take a standard MIME-encoded file, rename it to .doc, and Word would open it and treat the file as if it were a native document!

So how to create this document?

One option I found is via a Windows COM library called "Collaboration Data Objects" (CDO) that knows a fair bit about Multipart MIME Encoding, as it is used (among many other things) to create and send emails by Outlook Express, which are all in this format.

Assume then that we are in an ASP.NET page, and our user has clicked a link or button and wants that page (or one that looks just like it, perhaps without the navigation or other extraneous details) as a Word Document.

To do this, first we have to include a reference to this library, via COM Interop. Then, make sure you must set the AspCompat="true" directive in the Page declaration.

Next, there is a method on the Message class we could make use of called "CreateMHTMLBody", used ordinarily to create the body of an email message (or a web page when you do a "Save As..." in Internet Explorer, and select as the type of file "Web archive, single file (*.mht)" - the result is a Multipart MIME Encoded file thanks to this component.) This is not ideal, as it takes as it's main input a URI to a file to be encoded. We can't hand it anything like a number of files to encode or anything; they have to be web pages we have previously prepared to our liking that we can pass this component a URL to. What happens then is the component fetches the page from the web server, along with all the images it references, and any stylesheets - and bundles it all up in a single file. We get a Stream back with the encoded content. This we can then pass back to the user who clicked the link, after setting the Content Header to indicate to the Browser that it's a Word document, like so:

Response.ContentType="application/msword";
Response.AddHeader( "Content-Disposition", "attachment;filename=AFileName.doc");

This solution should work for all browser types but due to the Interop will not be a great performer so users will have to make judicious use of this feature! There is also the effort involved of making a version of your Web Page designed to look pretty as a Word document, as by default most web sites don't look good if converted without the narrow page format and multi-page nature of Word in mind, plus users won't want meaningless navigation and links to nowhere in thier document. You can use regular expressions to whack all the links like this:

Regex

href = new Regex("<a[^>]*>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
string document = AMethodToGetTheMIMEEncodedFile();
document = href.Replace(document, string.Empty);

Also, thanks to the wondrous power of stylesheets you can get your website pages looking pretty good without too much effort and you can provide your users with great looking editable versions of data on your web site.

In the next version of Office, which has an XML native document format, we have additional options for achieving this and I look forward to exploring those! In the meantime if anybody knows of a better way of doing this, I would love to hear about it :-)