Word XHTML - Mapping styles to semantics

This is the third post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In earlier posts, Zeyad discussed a general overview of the XHTML as well as a more detailed post on XHML compliance. Today Zeyad is discussing the ways in which styles have been directly tied to specific XHTML tags.

Today I wanted to talk a bit about the template that we use for the Word 2007 blogging feature. Word has always concentrated on the presentation of documents and making it easy for people to quickly create a great looking document. The area that we haven't focused on as much though is allowing people to better specify the semantic meaning of their content. We've been slowly moving in that direction with the custom XML support in Word 2003 and the content controls and XML mapping in Word 2007. We actually leverage a content control to allow you to specify the blog's title.

One of the oldest ways folks would specify semantic meaning in a Word document though was by using styles, and we've done work in Word 2007 to make styles much more convenient for the average end user. We've created a number of Word styles that we map directly to XHTML tags of semantic meaning (like <strong> <em> and <blockquote>). We then let the browser and blog sites determine how to render these tags (based on stylesheets, etc.).

In Word 2007 one of our investments was giving our users easy access to applying styles via "quick styles". In our blog template we provide a list of styles that can be applied to the contents of the document as you can see in the screen shot below:

Styles

These styles are all significant in that we can map them directly to XHTML tags (rather than simply to formatting properties). Below is a table listing all the styles provided by our blog template and their XHTML equivalent.

Style

HTML

Heading

h1, h2, h3, h4, h5, h6

Normal

p

Quote

blockquote

Code

<pre><code>… </code></pre>*

Strong

strong

Emphasis

em

*The code style is being added post Beta 2.

What do you guys think of having a style called "Code" that actually nests pre and code together? Word differs from the web in that we automatically preserve whitespace, so in order for us to correctly output XHTML for the code style we also need to output the preformatted style.

One interesting discussion that came up in some of my previous posts was whether it was better to use <b> and <i> or <strong> and <em> when people applied bold or italic formatting to their text. Having the <b> and <i> tags in the HTML guarantees that the look of the document will likely not change regardless of style sheet (it's more likely that <strong> would have CSS props than <b>). On the other hand, <strong> and <em> provide much more flexibility in that they really only imply semantics and not display values. While <strong> and <em> have a default presentation, it is often overwritten by the CSS of the page or the rendering engine.

Some people were saying that there are occasions where <b> and <i> better capture what the user intended. While I do agree that may be the case at times, I believe also our UI encourages the use of bold and italic when the user often was just trying to convey the semantics. In most cases, I believe they would have specified strong or emphasis if it were as easy and obvious as bold and italic (there just hasn't in been a benefit to doing so in the past). Since we are going the route of XHTML compliance and we are concentrating on structure rather than presentation, we opted to always output strong/em rather than a more confusing mixture of the two (bold/italic and strong/em).

Custom Styles

One area that we are looking at investing in is giving folks the ability to add custom styles to the blogging template that would then be output as a simple style tag. So unlike the above examples where the style is mapped to a specific XHTML tag, we would simply output a <p> or <span> where the class name then matches the style. So, if a user adds the style "foo" to their blogging template then when that style is applied we would output:

<p class="foo">…..</p>

We would not output the formatting information for the style because in most cases the CSS would be stripped upon publishing to a blog provider. Instead, with this approach, you could rely on the CSS of the host site of the blog to specify the presentation information for those custom styles.

Comments are welcome

Any comments or questions are welcome. Also let me know if there are any other similar structures you guys are interested in talking about next (ordered and unordered lists, definition lists)?