How to Dynamically Generate Word Documents from a Web Site


On my most recent project, an ASP.NET 2.0 web site, we thought it would be useful if we offered users the ability to have the page they were currently looking at available via a button-click as a nicely formatted Word Document they could email to friends or use as the basis of a longer document for thier own use. We have to use Word 2003 or XP, wanted any images to be embedded in the document, didn't want to buy any third-party components and we didn't want to have to run a copy of Word on the Web Server. I came up with a solution and I thought I'd share it here in case anybody had a better one!


The Word Document format is binary and so pretty much only Word can generate it. We didn't want to have to put Word on the server, as I'm pretty sure that's not a supported scenario, so making .doc files was out of the question. (I did it once, it was really painful <shudder>. )


I next looked at creating a document in HTML, as Word understands it and can format it pretty nicely. The big problem with HTML is that the HTML references images, and doesn't embed them like a native Word document or the PDF format, so that was out, as the document would only appear correctly if the user was on the same network as the image and we didn't like that restriction.


So what other formats does Word support? After some searching I discovered a solution - Multipart MIME Encoding. This is a common and very standard format and has the advantage that multiple files can be embedded, even images: they are simply encoded in Base64; each file is demarcated with a multipart MIME header. I discovered that you could take a standard MIME-encoded file, rename it to .doc, and Word would open it and treat the file as if it were a native document!


So how to create this document?


One option I found is via a Windows COM library called "Collaboration Data Objects" (CDO) that knows a fair bit about Multipart MIME Encoding, as it is used (among many other things) to create and send emails by Outlook Express, which are all in this format.


Assume then that we are in an ASP.NET page, and our user has clicked a link or button and wants that page (or one that looks just like it, perhaps without the navigation or other extraneous details) as a Word Document.


To do this, first we have to include a reference to this library, via COM Interop. Then, make sure you must set the AspCompat="true" directive in the Page declaration.


Next, there is a method on the Message class we could make use of called "CreateMHTMLBody", used ordinarily to create the body of an email message (or a web page when you do a "Save As..." in Internet Explorer, and select as the type of file "Web archive, single file (*.mht)" - the result is a Multipart MIME Encoded file thanks to this component.) This is not ideal, as it takes as it's main input a URI to a file to be encoded. We can't hand it anything like a number of files to encode or anything; they have to be web pages we have previously prepared to our liking that we can pass this component a URL to. What happens then is the component fetches the page from the web server, along with all the images it references, and any stylesheets - and bundles it all up in a single file. We get a Stream back with the encoded content. This we can then pass back to the user who clicked the link, after setting the Content Header to indicate to the Browser that it's a Word document, like so:


Response.ContentType="application/msword";
Response.AddHeader( "Content-Disposition", "attachment;filename=AFileName.doc");


This solution should work for all browser types but due to the Interop will not be a great performer so users will have to make judicious use of this feature! There is also the effort involved of making a version of your Web Page designed to look pretty as a Word document, as by default most web sites don't look good if converted without the narrow page format and multi-page nature of Word in mind, plus users won't want meaningless navigation and links to nowhere in thier document. You can use regular expressions to whack all the links like this:


Regex href = new Regex("<a[^>]*>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
string document = AMethodToGetTheMIMEEncodedFile();
document = href.Replace(document, string.Empty);


Also, thanks to the wondrous power of stylesheets you can get your website pages looking pretty good without too much effort and you can provide your users with great looking editable versions of data on your web site.


In the next version of Office, which has an XML native document format, we have additional options for achieving this and I look forward to exploring those! In the meantime if anybody knows of a better way of doing this, I would love to hear about it 🙂


Comments (1)

  1. SergiB says:

    Hi!

    I’m trying to do a sample project with this functionality in my company (in Barcelona, Spain), but I have a problem that is very important to solve for us:

    we do web applications in Spanish and Catalan languages, that is, using non-standard Ascii characters (marked vowels, special consonants); I think I can use “iso-8859-1” charset (Latin) or standard “UTF-8” charset to deal with this feature.

    Well, the problem is that: our web pages shows the localized text content correctly, but when I export the page content to a “.doc” document, using this MHT functionality, the DOC file opened with Word 2003 shows me wrong, strange chars instead, I cannot managed to load the doc with the localized chars well.

    I’ve tried several things, like forcing an UTF-8 charset in the ASPX source header, or forcing an ISO-8859-1 charset into de “HTMLBodyPart” property of the CDO.Message object, but I get no result.

    Also, I’m a bit confused about a few things: the “ContentTransferEncoding” property and its possibles values (“quoted-printable”, “8bit”,…), the “Charset” property and its values (“UTF-8”, “ISO-8859-1”, “US-ASCII”,…), and the “ContentMediaType” (“text/html”, “text/plain”,…), and the possible combinations of values for this 3 Message properties.

    I’ve also seen that Word 2003 “MHT” saved files uses “us-ascii” charset in the HTMLBodyPart, despite of any localized non-ascii chars on the doc content, but I understand that “us-ascii” uses 7 bits in opposition for Latin charset, that uses 8 bits…

    So, with all this… can you help me in any way? How can I save an aspx web page content, with special Latin characters, into a Word DOC document and be able to view it ok?

    Thank you very much!!

    Sergi.

    ————————————————

    Hi Sergi,

    I experimented with saving a page from a Spanish-language site as an MHT file using IE’s File-Save As feature; I renamed this file to .doc and it opened in Word ok (complete with all non-standard ASCII characters). I examined the file in a text editor and noticed that the content-type is set to ‘Content-Type: … charset=”iso-8859-1″‘ whereas this would be charset=”utf-8″ by default. What does your code create? If it’s charset=”utf-8″ then that’s your problem – you need to switch it. I don’t know of a way of getting the CDO component to do this; it may be simplest to do a search-and-replace of the stream the CDO component makes in code and switch them out. Not very elegant but it might be worth a try…

    – Adam

     

Skip to main content