Word XHTML – Bullets and Numbering


This is the fourth post by Zeyad Rajabi who owns the XHTML output from Word’s new blogging feature. In earlier posts, Zeyad discussed a general overview of the XHTML, details on XHML compliance, and how we map styles to semantics. Today Zeyad is discussing the ways in which styles have been directly tied to specific XHTML tags.


Today will be a short post about lists in our blogging feature. Word 2007 provides you with a rich editing experience that allows you to create a multitude of different types of lists, from simple standard one level lists, multi-level lists, to custom defined bullet and numbering lists.


Given the time and resource constraint for our blogging feature we decided to take a more simplistic route with lists. Our blogging feature only outputs two types of lists: unordered and ordered lists (we do not support definition lists). That is, we are only relying on <ul> and <ol> HTML elements to render the look of lists, which will give full power to the host browser for rendering.


For this release of the blogging feature we are not going to output the following CSS properties:




  • list-style


  • list-style-image


  • list-style-position


  • list-style-type

Not outputting such CSS properties limits the fidelity level we will support for our blogging feature when comparing to the full power of Word 2007 bullets and numbering list feature.


Word 2007 allows for defining custom style lists, such as using strings “Heading 1” and “Heading 2” to depict different levels in a list. Given that we will only rely on <ul> and <ol> HTML elements and not the CSS properties mentioned above, the number of lists supported in our blogging feature will be much less than Word 2007.



Sample Lists



Below is a collection of some example lists and the corresponding HTML output.



Simple flat numbered list




  1. item 1
  2. item 2
  3. item 3
HTML:
<ol>
   <li>item 1</li>
   <li>item 2</li>
   <li>item 3</li>
</ol>

Simple flat bulleted list




  • item 1
  • item 2
  • item 3
HTML:
<ul>
   <li>item 1</li>
   <li>item 2</li>
   <li>item 3</li>
</ul>

Nested bulleted and numbered lists




  • Level 1 item 1

    • Level 2 item 1
    • Level 2 item 2

  • Level 1 item 2

    1. Level 2 item 1
    2. Level 2 item 2
HTML:
<ul>
   <li>Level 1 item 1
      <ul>
         <li>Level 2 item 1</li>
         <li>Level 2 item 2</li>
      </ul>
   </li>
   <li>Level 1 item 2
      <ol>
         <li>Level 2 item 1</li>
         <li>Level 2 item 2</li>
      </ol>
   </li>
</ul>

Multilevel List




  • level 1

    • level 2

      • level 3
HTML:
<ul>
   <li>level 1
      <ul>
         <li>level 2
            <ul>
               <li>level 3</li>
            </ul>
         </li>
      </ul>
   </li>
</ul>

Nested paragraphs




  • Item 1

    Some text.


  • Item 2

    Some text.

HTML:
<ul>
   <li>Item 1
      <p>Some text.</p>
   </li>
   <li>Item 2
      <p>Some text.</p>
   </li>
</ul>

Nested paragraphs (w/o spaces)




  • Item 1

    Some text


  • Item 2

    Some text

HTML:
<ul>
   <li style=”margin-top:0px;margin-bottom:0px”>Item 1
      <p style=”margin-top:0px;margin-bottom:0px”>Some text.</p>
   </li>
   <li style=”margin-top:0px;margin-bottom:0px”>Item 2
      <p style=”margin-top:0px;margin-bottom:0px”>Some text.</p>
   </li>
</ul>


Comments are welcome

Any comments or questions are welcome.

Comments (22)

  1. Björn says:

    Hm, are all the ol in the sample correct or just some typos? E.g. shouldn’t the Simple flat bulleted list be an ul instead of an ol?

  2. BrianJones says:

    I didn’t get a chance to talk with Zeyad yet, but it looks like that was just a mistake. Thanks for point it out Björn!

    I’ve updated it to have <ul> for the bulleted lists.

    -Brian

  3. Mike says:

    It always amazes me to no end that such an open discussion blog should very consistently leave the interesting bits out of the discussion. What could be the reason for that?

    Case in point : what happens to fonts when this XHTML is being rendered by an agent that is either non-Windows, or even a Windows box without Office 2007 fonts installed.

    (I know you can choose Arial from Word, but since there 5 C fonts, with a clear incentive to use them in your formatting), I wonder what happens. What is the fallback scenario? I mean, Adobe embeds fonts in PDF, but what good XHTML is good for that?

    If you are using XHTML to turn it into a Windows Office 2007-only compliant community, then that’s very stupid indeed.

  4. BrianJones says:

    The current behavior is that unless the user specifies a font directly, then it is left unspecified in XHTML so that the blog’s CSS can control the look.

    Would you like a different behavior instead?

    Mike, just because something is a top priority for you to see discussed doesn’t mean it is for everyone else. If there are subjects you’d like to see covered, just let me know though and I’ll try to get something posted.

    -Brian

  5. Mr. Shiney says:

    In every version of Word I’ve ever used (through Office 2003) the built-in numbered lists feature is broken.  There’s even a very lengthy article from a Word guru about why on ExpertsExchange (sorry I don’t have the url handy) that is pretty definitive.  The ONLY way I’ve found to have working numbered lists in Word is to use sequence fields … and such is the workaround of choice with other shops I’ve talked too.  Anyway that leads to my question … a two-parter: 1) Do you know if lists are still broken in Word 2007?, 2) Does your XHTML by any chance support list numbering through sequence fields?  In regards to part 2 of my question it seems that it would be relatively easy to recognize the sequence fields and convert to the HTML structured tags…

  6. Zeyad Rajabi says:

    Mr. Shiney we have made some significant improvements to our bullets and numbering feature for Office 2007. Please stay tuned to Joe Friend’s blog http://blogs.msdn.com/joe_friend/  to get more details on what we have done. As for lists created from sequence fields the blogging feature only supports a small subset of the total number of features in Word 2007. One feature that we do not support is sequence fields. Instead of outputting sequence fields as <ul> or <ol> lists we will simply output paragraphs with appropriate number of non breaking spaces.

    Zeyad Rajabi (MS)

  7. Björn says:

    Zeyad, "Instead of outputting sequence fields as <ul> or <ol> lists we will simply output paragraphs with appropriate number of non breaking spaces." does not sound very promising in regards to semantics – yap, I hate it to be that emotional about the HTML I produce (or produce through applications) but we all have to bear our little burden :]

  8. Mike says:

    The whole point of blogging using Word 2007 as a formatting tool is to get it rendered with full-fidelity anywhere it’s viewed. Otherwise, why even bother?

    There is only one case when this fidelity does not matter, it’s when you type text without font formatting at all. I guess it’s only a slice of users.

  9. Chris_Pratley says:

    Mike, I think there are many more points to using Word for blogging than 100% WYSIWYG everywhere. Background spelling/grammar, autocorrect, local save, auto-recover, inline images and so on are some of many reasons.

    Back in Word2000 we believed that most people wanted Word to produce WYSIWYG HTML output. We thought of the browser as "electronic paper", and so went to great lengths to achieve the effect of reproducing on the screen what Word would have put on paper. We then got pilloried by large numbers of technical folks for doing that. Lots of comments about mangling HTML, etc. Yet the things we did were a) legal in HTML/CSS at the time (bugs notwithstanding), and b) required to get as close to 100% WYSIWYG on all the various sorts of browsers (netscape 3+ and IE 3+) we targeted. Yes some of the things were uglier than they had to be but no one is perfect.

    People had  conspiracy theories about us co-opting HTML as a format often because of some of the ugly stuff, while people in our hallways had T-Shirts saying "HTML sucks but we’re doing it anyway" (meaning HTML didn’t have the capabilities we needed to render office documents correctly but we were going to do our best because we felt it was so important). For a time HTML was even going to be the default format for Office 2000 apps because we were too caught up in the internet buzz of 97-99 too. But that didn’t last long as HTML’s limitations rapidly became clear as we investigated more.

    We also heard from people then and also recently who told us they just want Word to output the structure and content of the HTML, and let the formatting be controlled by something downstream, such as the blog site or the browser. So that’s what we doing. Are you really going to tell us to go back to the approach we’ve taken flak over for so many years?

    Be careful for what you ask for when you say "full-fidelity". End users have high expectations when we claim full fidelity so to us that really means Full fidelity and it comes with a lot of baggage as we have to get everything right visually no matter what contortions the HTML has to go through. It’s a lot of work and we don’t get any love for it.

  10. dodo says:

    Chris, all good points! Too bad Mike missed them.

  11. Mike says:

    "I think there are many more points to using Word for blogging than 100% WYSIWYG everywhere. Background spelling/grammar, autocorrect, local save, auto-recover, inline images and so on are some of many reasons. "

    Try Firefox. Two extensions added to v 1.5 and you are good. Free. Or may be you are assuming customers of Word 2007 are ignorant of what’s available out there?

    "People had  conspiracy theories about us co-opting HTML"

    Hehe. Because Microsoft never heavy-handed HTML, haven’t they? Isn’t DHTML its own standards (i.e. a proprietary file format used by a single vendor)?

    You managed to write a lengthy paragraph totally off-topic. I as a user want my blog to appear with full-fidelity. Period. It encompasses fonts. It’s a very old problem. Adobe fixed it by embedding fonts. Microsoft tried to copt web font embedding back with IE4, I honestly don’t know right now the mess that it still is (pending patents, sub-licensing, …), but I do know Microsoft is responsible for web font embedding to remain in such a sorry state since 1997.

    If you don’t provide full-fidelity, it’s the equivalent of writing a program on your computer and be unable to have it run on someone else’s computer.

  12. Andrew says:

    Mike, your point was that they didn’t provide "full fidelity" and that that was a problem.  Seems to me that Chris directly addressed that point.  In fact, your criticisms of past practices of different Microsoft teams is what is off-topic.

    Personally, I have no need for "full fidelity" in a blog feature, especially if it requires contorted HTML.  Most decent blogging software I’ve seen already has a CSS, and if you were not using blog software, you could create your own CSS.

    I can certainly see why some people might want full fidelity, though.  Maybe some sort of option would be a good idea, assuming enough people need this capability to make the effort worthwhile.

    I will note that, if the only way to accomplish this is through all kinds of hacked-up HTML, maybe the less-functionality but better-standardized code is really the right way to go.

  13. Mike says:

    Andrew,

    You are not living in 2007. You are satisfied with the way computers worked in early 90s. That’s fine. But I don’t expect that from software, especially if I have to pay for it.

  14. Andrew says:

    I’m not living in 2007?  You have a firm grasp on the obvious. 😉

    I admit I’m a bit confused by your response.  What is it about my point (really Chris’s point, since I think I was really echoing him) that is not applicable to 2007?  What is my 1990s expectation that is unrealistic of 2007?

  15. Peter Sefton says:

    I have been contributing comments here about HTML export and lists for a while now.

    Please don’t try to provide full fidelity rendering to HTML – as others have noted that is not a sensible goal. But please do something to let users produce decent HTML. As Zeyad notes it is too hard to map arbitrary Word lists of all kinds to HTML. But using styles is (relatively) easy.

    More from me here: http://localhost:8000/documents/blog/drafts/more_word_export.htm

  16. Peter Sefton says:

    Oops – I posted a local URL in the last comment.

    Should be: http://ptsefton.com/blog/2006/07/14/more_word_export

  17. MarkTreveil says:

    I feel sorry for Zeyad on this blog.  We have spent 5 years using existing Word versions to capture excepts of HTML to be  combined into  XHTML pages, and it seems to us you are definitely on the right lines here.

    Fidelity is definitely not our issue.  What is crucial is being able to get AAA-compliant XHTML output, where we can override all the styling with our own CSS.  

    We are dual-purposing – we build a ‘beautiful’ Word document  and in parallel a web page from the same content, entered via Word.  The web page must slot in consistently with the site as a whole, including Word-excepts from other users.   Hence consistency is our issue.

    The fundemental problems we have with old-style Word HTML are:

    1) Lists aren’t lists, and don’t line-up properly if the font size is changed.  Sounds like you are on the right road here, but please no inline styles.

    2) Tables.  Again, loads of inline styles and fixed-position layout  get in the way, plus problems with incorrect AAA markup – lack of captions, TH’s etc.  For web stuff we strip all the fixed-sizing etc that current Word puts in.  Trouble is, you need it for the ‘beautifl’ Word doc.

    3) Users can use styles fom the gallery, but more often they apply explicit formatting.  Your <b> debate is one example, but changing fonts is much more annoying.  So too with custom margins and tab stops.  Ideally all this sort of explicit formatting can be *easily* disabled, so that the input formatting is entirely via pre-defined styles.  Tab stops are a massive problem, of course, since this concept doesn’t exist in HTML

    My ideal solution would:

    1) Allow me to pre-determine if the editor used <b>, <strong>  or <span class=’MeSpeakingLouding’> .  There is no single answer for all applications, I suspect, although if you have to go for one, <strong> is safest.

    2) Allow me to easily eliminate all fixed-position markup from the HTML.  Maybe it is needed in the master text, but I want the option to strip it .

    3) All me to control all CSS style names through the word styles.  MsoNormal is not ideal!

    4) Easily lock down all explicit formatting options, tabstops, margins etc.  No select-and-apply fonts, bold, etc

    I don’t know how far you are down the road with all this –  I’ve just been using Word2007 Beta 2 and the Save As HTML is just a rubbish as before, not even XHTML.  But it sounds like there is fab stuff on the way.  Can I find any of this in this Beta?  

    Good luck!

    PS happy to do some real-world testing if you need it.  I have 10,000s of docs and web pages to test the conversion on.

  18. Zeyad Rajabi says:

    Hi Mark,

    Thanks for the feedback.

    As for playing around with the XHTML output, you can do so with Beta 2. Please be aware that there are a few bugs with our XHTML output in Beta 2. We have made many improvements since then.

    Zeyad Rajabi (MS)

  19. Denise says:

    I am working with a Word template/form that has dropdown menus.  When I save the xml doc as data only or try to use a similar stylesheet on the file, the value selected in the dropdown menu is lost.  Does anyone know where/how this value is represented and how to maintain it while stripping away wordml formatting?

    Thanks

  20. Steve says:

    I have the same problem as Denise. Can anyone help wioth this XML problem?