XHTML in Word 2007’s blogging tool


Today we have a guest writer to discuss the HTML output that we have in the new blogging functionality for Word 2007. His name is Zeyad Rajabi and he’s a program manager on the Word team. Zeyad works on file format related issues, including the HTML support in Word. All of Zeyad’s posts will be under the “Word HTML” category if you are interested in tracking those seperately. 


As some of you may know from Joe Friend’s blog, Word 2007 will allow users to author blogs straight from Word. I want to follow up on Joe’s blog by giving you guys more details concerning our XHTML output for the blogging feature. I hope to use this blog as an opportunity for you to comment on our blogging XHTML output and to make any suggestions.



Goals



Before I get into details about our XHTML output, I want to outline the goals for our blogging feature. The design goals behind the XHTML output from the blog tool are significantly different from what we’ve done in the past:




  • Output XHTML compliant code for each post (we are following the W3C spec)


  • Output clean and readable XHTML


Instead of concentrating on supporting 100% of Word’s features (as we did in the past) the blog feature will support a much smaller set of features and additionally concentrate on outputting clean and readable XHTML. The blog feature will only output the necessary XHTML needed to represent the document. No more redundant HTML or CSS. No more Microsoft Office specific CSS properties. We will output just clean and easy to read XHTML.



Known Beta 2 Issues



There are still some known bugs in the XHTML output for Beta 2. I wanted to point them out so that you aren’t surprised:




  • Strikethrough – We are outputting CSS property text-decoration for strikethrough instead of <del>


  • Divs around lists – We are outputting div tags for every list item. We do not need to output these extra elements


  • Block level elements within inline elements – We are not XHTML compliant in some cases because we are not following proper tag content flow. We are outputting block-level elements inside inline elements.


  • Multi-level lists – We are incorrectly outputting multi-level lists in terms of being XHTML compliant. We are outputting the incorrect XHTML in that we are closing the lists for before sub lists are closed


  • Table bloat – Our XHTML output for tables is too heavy and contains too much redundancy

I am sure there are more bugs to be found and I’m sure you guys will help me add to the list! As you play with the blogging feature, please feel free to send me any questions or suggestions you have. I want to make this feature great for all of us.



XHTML Output



There is too much to discuss in this first post, so I think I’ll break down the XHTML output into multiple categories: formatting, styles, lists, images and tables. I’ll have a separate post for each category so we can have some more targeted discussions. Another thing that I was thinking about doing was pulling all of this together as a public spec that I can post. Again, I would love for you to send me suggestions on any or all of the categories.


For those interested take a look at the source code the blogging tool generated for this post (note that it’s only the contents that would go inside the ).



Formatting



Today let’s look at some details around the XHTML we output for formatting features:












































Feature


XHTML


Hyperlinks


<a href=”http://www.foo.com” target=”_blank” title=”Tip”>hyperlink</a>


Font


<span style=”font-family:XXXXXX;”>text</span>


Font Size


<span style=”font-size:28pt”>text</span>


Font Color


<span style=”color:XXXXXX”>Colored text </span>


Bold


<strong>text</strong>


Italic


<em>text</em>


Underline


<u>text</u>


Strikethrough


<del>text</del>


Highlighter


<span style=”background-color:XXXXXX”>text</span>


Alignment


<p align=”left”>text</p>


<p align=”right”>text</p>


<p align=”center”text</p>


Indent


<blockquote>text</blockquote>



Suggestions are Welcome



I know there are a couple different approaches for all of these. If you disagree with our approach let me know. I’ve read a lot of differing opinions on some of these (especially indentation), so while we probably won’t get everyone to agree 100% on the approach, hopefully we can find the best approach.


Anything missing? Is there a better way of representing a feature in XHTML?

Comments (41)

  1. Mike says:

    Looks like generated XHTML tag names are in upper case, shouldn’t they be lowercase ?

  2. Mike says:

    The obvious question, what about fonts?

    How Office 2007-specific fonts are translated to be rendered on the web on anyone’s machine? Are they? Jensen Harris explained that one of those fonts is chosen by default in any of Office 2007 applications, so that makes the issue even more blatant.

  3. k says:

    you guys hurt my head.

    you do realise that  the target attribute is deprecated in xhtml right?

    align=left – is it still 1998?  as it appears to be text you’re aligning how about the following css:

     text-align: right;

    you’re not seriously planning on using the blockquote tag to specify indentation are you?

    jeez

  4. mystere says:

    bad bad bad bad bad….

    Ok, so not as bad as it used to be, and this is a huge improvement, but gratuitious use of style tags are *nearly* as bad as legacy HTML font and various other presentation tags.

    The whole point is to remove presentation from the HTML, not simply change the way you embed it in the HTML.  

    Also, while align and target are valid in XHTML 1.0 Transitional, they’re deprecated, and not valid in 1.0 strict or 1.1.

    Why not generate a style sheet?

  5. y says:

    Why are you using semantic elements like "strong" and "em"?  Do you really have any indication that those are the semantics actually intended, or are you just imputing that from the appearance of the text?  If you don’t have real semantic information, then shouldn’t you be using CSS to supply these, or just plain "b" and "i"?  It might be appropriate to use "del" for text that Word has marked as deleted, but why use it for text that is simply styled with a strikethrough?  And why are you using the deprecated "u"?  

  6. ola says:

    I agree with mystere.  I don’t get the impression your really trying at all here.  Put some effort into it man.

  7. rasx says:

    You will definitely need an options panel to allow users to control how much attribute-based formatting they want vs. css-based formatting. One case, I address with my Word 2003 tool, CleanXHTML, is to turn align="left" off for CSS friendliess.

    Denying "power users" to choose these levels will be another, "classic" Microsoft move that should be avoided at all costs.

  8. Peter Sefton says:

    I posted a comment to a previous post from Joe, but I’ll repeat the link here.

    My suggestion is to use styles (or at least give the option) to drive the HTML formatting. Zeyad says you have some issues with lists and indenting. Of course you do – trying to largely flat map word processing formatting to a nested format.

    With a good set of styles then you can map from the word processor to XHTML much more reliably. Why not ship Word with a better set of styles than what comes in the Normal template by default?

    This post on my site has some pointers to more information about the way we do it, in a system that works with both Word and OpenDocument:

    http://ptsefton.com/blog/2006/05/13/beyond_blogging:_style-driven__html_export_from_2007._please.

  9. Jon Peltier says:

    "Denying "power users" to choose these levels will be another, "classic" Microsoft move that should be avoided at all costs."

    They’re already doing it with the whole Ribbon UI.

    – Jon Peltier

  10. Zeyad Rajabi says:

    First off I want to thank you guys for your comments. I really want to make my posts about getting your feedback and opinions. Remember that this tool is not finished yet. Part of the process of shipping this tool within Word is getting user feedback. We want to understand what type of XHTML and CSS you guys prefer. We will certainly use your feedback and suggestions in making this blog feature in Word great.

    Thank you Mike for mentioning that we are outputting XHTML tags that are in upper case. I should have mentioned that there is a bug in our Beta 2 release of the blogging tool where certain CSS properties in tables are being output in upper case. We are certainly working on fixing that issue in order to make sure we output valid XHTML.

    Our blogging tool does not output font for content unless the author directly formats text with a given font. If a user explicitly formats text with a certain font we will output <span style="font-family:XXXX">text</span>.

    k brings up a great point about align being deprecated. I love to see these types of comments. It will help keep me in check and make sure our XHTML output is valid and clean. I really want to get you guys involved in making suggestions that will help us improve our XHTML output. Do you have a suggestion for indentation?

    mystere, unfortunately we cannot generate a style sheet for our blogging solution because the blogging sites out there would just strip it off. Only the body of the XHTML would be consumed by the blogging sites.

    Peter, thank you for your suggestion. In my next post I will touch more on styles and what type of styles our template provides out of the box.

    y brings up an interesting topic: why are we using strong and em instead of bold and italic? I am going to talk more about styles in the blogging tool in my next post. We are providing users with a set of styles that they can apply to a blog within Word. A couple of those styles provided are strong and em. What do other people think about the usage of bold and italic vs. strong and em?

    Providing more power to our users in terms of XHTML output is a great idea. We will certainly look into what we can do to improve that scenario.

    Zeyad

  11. Dean Harding says:

    Wow, lots of MS-haters here! Personally, I can’t wait for this feature. I use Word 2003 to type out my blogs at the moment (mainly because of the spell checking/autocorrect) and I end up putting in the (X)HTML once I’ve cut’n’pasted my text from Word into my browser. Actually, the only reason I do it like that is because while I like the smart quotes in my text, I can’t have them in my <a href=""> tags, so I go in and add them all once I’ve typed the text out.

    At least now (well, soon) I will just be able to do it all from Word!

  12. Dean Harding says:

    Mike: the other thing about the upper-case tags is that the blogging software on blogs.msdn.com is Community Server, which uses FreeTextBox for it’s WYSIWYG editor. So if the author posted from Word, then opens the post up in the Community Server editor, FreeTextBox munges all the tags so that they’re upper case. It’s pretty annoying actually, and it produces some horrible HTML, but that’s what you get with web-based WYSIWYG editors, I guess.

    Zeyad: I think what people are recommending in terms of separate CSS is an actual separate file which we could install on the server separately, then instead of <span style=""> tags, we’d get <span class=""> tags or something. Myself, I don’t think that makes sense. After all, that’s not really how Word works – and as good as it is that we can get this clean XHTML out of Word, I don’t think trying to force it to a whole new paradign just for blogging is a smart thing to do.

  13. k says:

    zeyad, it depends on what you’re trying to indent, but usually a:

      <p style="padding-left:10px;"></p>

    or similar, should do the trick

    blockquote should only be used when quoting someone

  14. Zeyad Rajabi says:

    Word supports a subset of the standard HTML 4.01 specification and similarly a subset of the standard CSS 1.0 specification. Unfortunately there are certain CSS properties that are not understood when applied to certain elements. Padding CSS properties cannot be applied to span, div, or p elements.

    We can use margin CSS properties on div or p elements. Any objections to using margin CSS properties for indentation?

    For anyone who is interested I am currently working on publishing Word’s HTML and CSS specification on MSDN. As soon as it is ready I will let everyone know.

    Zeyad

  15. Can we please please not put target=_blank on hyperlinks by default? Yuck. Any aggregator that doesn’t do that automatically isn’t worth considering, and when I’m reading your blog by hand I don’t want you popping up new windows all over the place unless I ask you to. Geez.

    Having said that in general I like these new features a lot. I disagree with the poster above who thinks you shouldn’t use <strong> and <em>. The chances of the semantic tags being correct are vastly greater than the chances that users will ever, in practice, do anything other than click on the "B" and "I" icons in Word’s toolbar to indicate emphasis semantics. A force of habit that strong is impossible to *ever* change.

    For explicit formatting that has no obvious equivalent semantic choice, I think I’d like the option to do it the old/bad way "<font color=…>", "<table bgcolor=…>" etc. This is because some aggregators, bloglines for example that I know of, will strip out style attributes entirely because the full power of CSS is all but impossible to sanitize safely. This sucks, but it’s the way the world is. Obviously you couldn’t do that in an XHTML export, but in a straight HTML version you could, and my opinion is that you should, at least as an option.

    And I agree with other posters that it should be possible to get this clean HTML/XHTML via a regular "Save As…" option, as well as from the blogging tool.

  16. k says:

    using margin instead of padding should be fine, it’ll probably give you better compatibility with IE anyhow.

    out of interest, if Word only supports a subset of HTML 4, why not fix word so that ot supports XHTML 1.0 in the first place?

  17. sonu27 says:

    How can you be XHTML compliant when you are using the <U> tag? That is so outdated.

    Also the CSS needs to be lowercase that how everyone likes it!

  18. As I promised, we are posting details of the HTML output spec for the blog feature and are interested…

  19. J says:

    <ins> and <del> are deprecated in the working draft of the XHTML 2.0 standard.

    There’s others, too.

  20. Ron says:

    Looking at your source code with SCE Validator I noticed the tags around <body> haven’t been replaced with the entity references &gt;&lt;, so the validator was picking up the body tag and showing an error.

    Also, ‘valign’, ‘border’ and ‘align’ are all deprecated in XHTML. It’s all well and good to aim for XHTML compliance, but you need to choose which version you want to support!

    1.0, 1.1, Strict, Transitional? Which one is it?

    http://www.w3.org/QA/2002/04/valid-dtd-list.html

    A simple way to render a table with borders is to use something like this:

    <style type="text/css">

    table {

    background-color: #000;

    }

    td {

    background-color: #fff;

    }

    </style></head><body>

    <table cellspacing="1">

    <tr>

    <td>1</td>

    <td>2</td>

    </tr>

    <tr>

    <td>3</td>

    <td>4</td>

    </tr>

    </table>

    However this isn’t very printer friendly as it requires the user to allow background colours to be printed.

    Also, think about combining CSS values, for example:

    – padding: 1em 2em 3em 4em;

    Also, you can’t wrap H1 (2,3,4) inside P, same way you can’t have UL inside P, etc.

    Anyway, why are we teaching MS about (X)HTML and CSS? You’re supposed to be the experts, XHTML is not subjective, there’s a right way and a wrong way to format something. Don’t get caught up in debates about what’s right or wrong because it’s already defined in the specifications.

    The HTML 4.01 spec is a good place to look for valid markup, as XHTML doesn’t replace much from normal HTML 4.01.

    See here:

    http://www.w3.org/TR/html4/present/graphics.html#adef-align

  21. Tom Edwards says:

    Actually k, the property for text indentation is literally text-indent. 😉

  22. Jonathan P says:

    I think the first thing that ought to be addressed is WHAT XHTML version is the blogging tool going to support. Saying that you support XHTML is vague at best. Sound choices probably lie in XHTML Basic and XHTML 1.0 Strict. Once this is established you have a baseline for what tags and attributes can or should be supported.

    Secondly, there seems to be an unclear distinction between presentation and semantics. Bold (<b>) doesn’t equal strong emphasis (<strong>), italicize (<i>) doesn’t equal emphasis (<em>), strikethrough doesn’t equal deleted (<del>).

    It’s my firm opinion if you can’t determine the meaning behind the user’s style choices then you can’t blatantly add semantics to the output document. Outputting <b> and <i> (or style equivalents) for bold and italicize respectively is safe and ultimately the right thing to do without additional information of the user’s intent.

    Strikethrough should be left as is.

  23. k says:

    Tom I stand corrected.

  24. Zeyad Rajabi says:

    Our aim is to be as close as possible to XHTML strict 1.1 by the time we ship this feature. That being said, as I mentioned in a previous comment, Word has some limitations as to what CSS is supported. I will touch more upon this limitation when I start talking about images and in particular floating images.

    To be safe I am going to say we are aiming to be XHTML Transitional compliant, depending on the content of the blog. There is a lot of balancing decisions being made in terms of fidelity vs. being XHTML compliant. I hope to use these posts as a means to ironing out those kinks.

    Jonathan you bring up an interesting topic: semantic vs. presentation. We decided to go down the route of presentation for bold and italic rather than output both bold/italic and strong/emphasis. What would be the main advantage of outputting bold instead of strong? I am trying to think of this in terms of all the different types of users who will use this feature; from the common user to the expert user.

  25. Mr K says:

    I was hoping this feature (clean XHTML) would be available as a standard save option but, from a quick look at the Word Beta, it seems it is only available on the blog publish.

    Is there no way to get the XHTML code without using the blog publish?

    If so, why not make this an available save option?

    Surely getting the world’s most widely used document format into clean HTML and onto the web should be easier?

  26. Jonathan P. says:

    Zeyad – I believe most of this discussion is purely academic. Be that as it may I like to see things done “right”. As far as XHTML version it would seem to me that XHTML 1.0 Strict or XHTML 1.0 Transitional should be the target. XHTML 1.1 (there is no  strict version) offers few benefits over XHTML 1.0 Strict and depreciates the use the style attribute (although still valid).  Since XHTML 1.1 cuts out most of the presentational tags and frowns on the use of the style attribute this gives Word little latitude to add stylistic information the output document. It doesn’t sound like Word will be generating classes and creating internal or external style sheets, not to mention this probably wouldn’t work well in a blogging scenario. Also XHTML 1.0 can be easily converted to HTML (4.01) as well.

    The decision on whether to use strict or transitional versions of XHTML 1.0 lies in where you want to put your presentation.  It seems your either stuck with presentational tags (transitional) or the style attribute. Neither of these are ideal but they both work. Using XHTML 1.0 Strict and the style attribute is probably better since you don’t muddle the XHTML source with unnecessary tags. Should you need to make a change or use the result  XHTML elsewhere you can strip or simply modify style attributes.

    Perhaps the most gaping problem people fail to understand is the huge difference in philosophy between XHTML and Word. A good XHTML document is all about semantics and word has very few explicit semantics. If you were to do a direct translation of a Word document to XHTML you would have a XHTML document full of <p> and <span> tags set with style and class attributes to display as headings, lists, etc. Since some styles are really common  (like lists and headings) generating the more appropriate heading, ordered list, unordered list, and list item tags is possible.

    The main point I wanted to make, however, is that some style choices don’t necessarily correspond to semantics (like the ones people choose from toolbars when aren’t picking “styles”. And this is probably how the majority of the people in the world use Word). The big debate here seems to be bold vs. strong emphasis and italics vs. (regular) emphasis. The bottom line is Word knows I want something to be bold or italics, but it has no clue WHY I want something to be bold or italics. As such is make more sense to use <b> and <i> tags over <strong> and <em>.  The same philosophy can be applied to indenting vs. block quote and strikethrough vs. delete.

  27. HI John says:

    I agree 100% with Mr. K, when he says, "Is there no way to get the XHTML code without using the blog publish?"

    "…why not make this an available save option?"

    Can someone please answer these questions?

    My wife and I work on maintaining several Web sites, none of which would qualify as a ‘blog’. Why is MS suddenly abandoning us, in favor of bloggers. I could understand if you had already incorporated an .html export system in Word. But, in Word 2007, apparently you’ve removed what little support there was.

    Blogs are used generally by .html dummies and I can understand your wanting to score some points with them, but what about the rest of us who know something about .html and would like to use MS Word to help us create Web pages?

  28. Zeyad Rajabi says:

    Currently the plan is to only make the XHTML output available via the blogging feature. That being said we are looking into different ways of making this available; like via the OM. The reason we decided not to offer a Save As option for the XHTML output is because the level of fidelity does not match all of Word’s feature richness. This blogging feature is meant to maintain the fidelity of a small percentage of the total feature set available from Word.  

    Zeyad

  29. Zeyad Rajabi says:

    The XHTML output is specific to the blogging feature. The reason we decided not to offer a Save As option for the XHTML output is because the level of fidelity does not match all of Word’s feature richness. This blogging feature is meant to maintain the fidelity of a small percentage of the total feature set available from Word.  

    Zeyad

  30. Ron says:

    This joke seriously cracks me up!

    "Blogs are used generally by .html dummies and I can understand your wanting to score some points with them, but what about the rest of us who know something about .html and would like to use MS Word to help us create Web pages?

    "

    hehe, good one. 😉

    Please for the love of the Internet, don’t offer a "Save as HTML" feature. Zeyad has the right idea, good on ya.

  31. Sam Sethi says:

    Anyone who follows the W3C’s XHTML development and that of the web should be aware that presentational markup is being removed out of the body of the text in favour of CSS whether it is a blog or webpage.  

    Word with its templates and autoformat functionality is well placed to deliver this presentational separation. A simple word doc could be saved as XHTML without presentational markup but then that might diminish the role of the .DOC format!

    Jonathon P makes some very good points. Pick your flavour of XHTML transitional (very safe but pointless), strict 1.0 (expected and a good starting point) or even may I be as bold to emphasis XHTML 2.0 (daring and leading edge but unlikely) which is now a W3C standard or will be by the time Word 2007 ships.

    Equally what about the support for mobile word in the Mobile 5.0 OS.  Will XHTML-basic/MP be supported?

    Finally why is there a reference to Yahoo in the source code?

    http://geo.yahoo.com/serv?s=76001405&t=1149288117  

  32. Mr K says:

    "The reason we decided not to offer a Save As option for the XHTML output is because the level of fidelity does not match all of Word’s feature richness."

    Then why offer the ability to save a Word document as a text file?

    At least if a save as XHTML feature could be accessed through some obscure VBA code then that would be good enough.

    It is not a case of using Word to design webpages but simply making it easier to get the contents of Word documents onto webpages.

    It doesn’t have to be a standard or obvious feature but at least if it could be made available without using the blog publish then I’m sure it would be prove useful for many people who get Word docs and then have to get them onto the web without all of the Word HTML tags.

  33. One of the features in Word 2007 Beta 2 is the ability to author blog posts. Joe Friend announced the…

  34. This is the third post by Zeyad Rajabi who owns the XHTML output from Word’s new blogging feature. In…

  35. Here’s an encouraging titbit that demonstrates that community server’s growth is

    set to continue.&amp;nbsp;…

  36. Zeyad Rajabi says:

    We are looking into ways of providing our XHTML output outside the scope of just blogging. One such mechanism is through the OM.

    Mr. K save as plain text differs than our output and intention of XHTML. Our XHTML output is only a small subset of the total XHTML specification. Only a % of Word’s features can be represented by our limited XHTML output. As we continue to build on top of this blogging feature, and specifically our XHTML output, we will support more and more of Word’s features. As is, we do not want to provide our XHTML output as a Save As option because of the feature degradation. Plain text differs in that our output preserves as much fidelity as possible with respect to plain text. When we get to that level for our XHTML output I do not see any reasons why we wouldn’t have it on the Save As menu.

    Zeyad

  37. This is the fourth post by Zeyad Rajabi who owns the XHTML output from Word’s new blogging feature. In…

  38. This is the fifth post by Zeyad Rajabi who owns the XHTML output from Word’s new blogging feature. In…