Interoperable HTML Parsing in IE9


The HTML parser is an important part of how we deliver on same markup because it plays a vital role in how the DOM is constructed.  Therefore, it also plays a big role in how any DOM API or CSS rule is applied.  While we’ve talked a lot about some of the high-profile API improvements in IE9 – getElementsByClassName, addEventListener, and so on – one important improvement we haven’t talked about is the HTML parser.

This is clearly important for developers, so we made interoperability improvements to our HTML parser in IE9 Standards Mode.  This blog post provides practical guidance on how these improvements affect your site and how to avoid pitfalls in areas where all browsers still don’t behave the same way.

innerHTML

Originally introduced as IE-proprietary APIs, innerHTML and outerHTML have gained some early traction as standards and are widely implemented by other browsers, but with some differences.  These methods are unusual among DOM APIs in that they invoke the parser.  In IE9 we made changes to address the most common interoperability issues.

Much of the work we did here was simplifying our behavior internally.  Prior to IE9, we took whatever input was passed to innerHTML/outerHTML and treated it as if it were the only content in an otherwise blank page (resulting in an implicit <html>, <head>, <body>, etc.).  We then attempted to merge this page back into the calling element, which sometimes resulted in an “Unknown Runtime Error.”

In IE9, we improved the behavior to support more cases while removing all occurrences of “Unknown Runtime Error.”  In cases that still don’t work, you’ll get a descriptive DOMException instead.

While the mainstream scenarios work pretty well across browsers, these APIs are still evolving and interop isn’t perfect in every case.  For example, the following has different results in different browsers:

var img = document.getElementsByTagName(‘img’)[0];
img.innerHTML = “image text”;

The <img> element can’t have children, so the above doesn’t work in Chrome, Safari, or IE8, and has different behavior altogether in FF3.6.   In IE9, Opera, and FF4 Beta, cases like this work as expected, and the text node is inserted properly.

In order to avoid problems with innerHTML, it’s a good idea to only feed it markup that can stand on its own.  For example, calling div.innerHTML = “<p></p>” is fine, because <div> and <p> can exist without each other.

For small edits, you can also use DOM Core APIs like appendChild.

Generic Elements

One request from developers is having better support for generic elements.  A generic element has the same syntax as any other element, but a tag name that isn’t defined in HTML (for example, <awesome>).  IE9 Standards Mode follows the HTML5 spec and treats generic elements much like <span> tags. This means you can add more descriptive tag names to your page and style them as you would any other element:

<awesome style=”font-size: large;”>IE9</awesome>

This allows you to semantically describe the content of your page without losing any of the power you have with normal elements, using the same code as you would in other browsers.

Whitespace

One change that affects almost every page is how we parse whitespace.  While IE8 removes or collapses whitespace, IE9 persists all whitespace into the DOM at parse-time.  So the following markup:

<div>
<span>IE	9</span>
</div>

Was represented in the IE8 DOM as:

div
|->span
|--->”IE 9”

And is represented in the IE9 DOM as (whitespace in red):

div
|->”\n“
|->span
|--->”IE\t9”
|->“\n

If your site depends on the existence or non-existence of whitespace, this change has substantial impact. The document structure will contain far more whitespace nodes, so APIs like firstChild might not reference the same node they used to.  Another consideration is text node length.  Because whitespace is now preserved within text nodes, the character index within a string might be different from what you’re expecting. 

IE9’s behavior matches the HTML5 spec and interoperates with other browsers. There are ways this behavior can make your page more fragile, depending on how you use whitespace in your markup.  Here are a few suggestions for avoiding these problems:

  • For scenario where you just want elements, use the Element Traversal APIs – calling functions such as firstElementChild to ensure you don’t reference a stray newline character by mistake. 
  • For scenarios where you need more than just elements, like text nodes, use explicit type-checking via nodeType or a similar API.  Depending on why you’re accessing individual characters in a text node, the split() method on JavaScript’s String object could be quite useful for isolating the parts of a string you want to examine.

Overlapping Tags

As web developers, we don’t like to admit it, but we’ve probably all written the following markup at some point:

<b><i>important text</b></i>

Overlapped tags are a far more common occurrence than you might think, partly because they’re not always as obvious as the example above.  Take the markup below:

<p><b><div>text</div></b></p>

The <p> element can’t legally contain a <div>, so IE, Firefox, Chrome, and Safari implicitly close the <p>.  It’s almost as if you’d given this markup to the parser instead:

<p><b></p><div>text</div></b></p>

Notice that you didn’t even have to overlap your tags to end up in an overlapping tags scenario (the <b> element, in this case).  This is just one edge case — as you explore more scenarios, you’ll find that they can get pretty complex. 

If you open up the IE8 Developer Tools to inspect the markup above, you’ll see this structure:

p
|->b
|--->div
|----->”text”
div
|->”text”
p

It seems reasonable enough, but there’s actually more going on beneath the surface.  In previous versions of IE, we persist the overlapped markup more or less as written – meaning an overlapped element could occupy more than one position in the DOM tree. 

This state – called an inclusion – occasionally leads to behavior difference across browsers, especially when using script to walk the tree.  For example, calling nextSibling on the <b> element above will return the second <p> and calling firstChild will return null.  This occurs in spite of the fact that the <b> element appears to be a parent of <div> and have no siblings.

We improved IE9 mode to resolve such situations at parse-time to avoid these side-effects.  In any place where earlier versions of IE would create an inclusion, IE9 creates a clone of the element instead.

So the markup from the example above would exist in the IE9 DOM as:

p
|->b
b
|--->div
|----->”text”
p

IE9 clones the <b> tag when it sees the implicit </p> end tag.  Thus, the DOM contains two distinct <b> elements, matching Chrome and Safari in this case.  The HTML5 algorithm (supported by FF4 Beta) differs in that it clones overlapped elements upon encountering the next text node – resulting in a slightly different DOM structure above.

In order to avoid these types of problems in the first place, it’s a good idea to run your markup through the W3C’s online validator to help spot these kinds of problems before they become real bugs.  For convenience, IE’s F12 Developer Tools have a built-in link to pass a site through the W3C’s validator.

Title Element

In IE8 and earlier versions of IE, the parser implicitly creates a <title> element whenever it encounters a <head>.  As a result, developers in IE8 can assume that head.firstChild returns a <title> element, even if you don’t explicitly declare one in your markup.

In IE9, we made an interoperability change to respect the <title> element’s position in the <head>, like other major browsers.

Much like whitespace handling, this could result in your site behaving differently in IE9 than previous version of IE if you write applications that depend on the first child of your <head> element always being <title>.

If you need to grab the title, a better approach would be getElementsByTagName.

Object Element

Historically, the <object> element’s behavior in IE has been rather idiosyncratic, largely due to the fact that web sites often use it to interface with native code running outside the browser sandbox.  In IE9, we’ve improved <object> parsing so it and its contents appear in the DOM like any other element.

For example, any <param> elements or fallback content inside the <object> will be persisted in the DOM, regardless of whether the <object> successfully loads.

This means that calls like the following will now work:

alert(document.getElementsByTagName(‘param’)[0].nodeName)

You shouldn’t have to do anything special to take advantage of our new behavior – but you can now interact with <object> much like you can most other elements and like you can in other browsers.

While these changes may seem less important than adding or changing an API, the impact on real web development is substantial. If you’re a developer, try your site in the latest Platform Preview and look for any problems resulting from the changes above.

As always, please send your feedback via Connect or the comments section.

Thanks!

Jonathan Seitel

Program Manager

Comments (23)

  1. Anonymous says:

    Wait, those diagrams are ambiguous. Did you mean

    div

    |->span

    |—>”IE 9”

    or

    div

    |->span

         |->”IE 9”

    ?

  2. Anonymous says:

    Hello! does this mean that internet explorer 9 will have images/pictures showing up faster?

  3. Anonymous says:

    > The <img> element can’t have children, so the above doesn’t work in Chrome,

    > Safari, or IE8, and has different behavior altogether in FF3.6.   In IE9, Opera, and

    > FF4 Beta, cases like this work as expected, and the text node is inserted properly.

    Inserted where? In the <img> element where it can't go?

  4. Anonymous says:

    Reading the IEBlog is like watching Elaine dance on Seinfeld.

  5. Anonymous says:

    Why not just implement the HTML5 parsing algorithm?  That's the best road to standards compliance and interoperability.

  6. Anonymous says:

    @Brianary:

    The text node is intended to be a child of the <span>.  I apologize if my ASCII art was unclear — I used whitespace to show nesting when I was first writing the post, but eventually switched to arrow length because I thought it looked better.

    @Dave:

    Yes, according to HTML5, the text node should be a child of <img>.

  7. Anonymous says:

    IE definitively needs some work around the .innerHTML, glad your team is recognizing it.

    Does this mean web authors will finaly be able to user .innerHTML on a <select> object or a <tr> object?

    Thanks

  8. Anonymous says:

    Is this going to be as fast as FireFox?

  9. Stilgar says:

    I second Mathieu Pellerin's question. Will innerHTML work with tables?

  10. Anonymous says:

    Huh, using "generic elements" just because it works is a terrible idea. If people start using that instead of class="", it won't be possible to add any new elements to HTML because they will clash with existing content.

  11. Anonymous says:

    > Why not just implement the HTML5 parsing algorithm? That's the best road to standards compliance and interoperability.

    +1

  12. Anonymous says:

    As Philip said, although it's good to have Interoperability with generic elements, their use should not be encouraged. And, in HTML5, they're invalid.

    Will an 'awesome' element be an instance of HTMLUnknownElement?

  13. Anonymous says:

    "The <img> element can’t have children […]"

    Does the html5 spec really forbid contents of HTMLImageElement?

    It inherits innerHTML from HTMLElement and dev.w3.org/…/Overview.html says:

    "The contents of img elements, if any, are ignored for the purposes of rendering."

    This implies content is allowed and I cannot find where the spec says otherwise.

    In html4 the DTD forbids content of img elements (http://www.w3.org/…/objects.html).

    I think this should be also with html5, as content would not make sense anyway when it never is rendered.

    So, is it a feature or a bug of html5? If it is a feature, what purpose does it serve? Backward compatibility with invalid html4 markup?

    I assume it is a feature which i dont understand, as the algorithm given in dev.w3.org/…/Overview.html could be easily modified to raise exceptions when content is inserted where it is not allowed, if it is not allowed.

    Somebody shedding light on this topic would be highly appreciated,

    g

  14. Anonymous says:

    “This allows you to semantically describe the content of your page”

    To cite Humpty Dumpty: “When I use a word it just means what I choose it to mean”.

    Alice, me, all browsers and all search engines won’t be able to understand your proprietary semantics. And worse, they might clash with standard semantics of the future.

    What a bad bad bad suggestion! Like nobody has ever thought about it. And as if adding a simple class wouldn’t  solve every use case.

  15. Anonymous says:

    > Stilgar

    > I second Mathieu Pellerin's question. Will innerHTML work with tables?

    Well, they say, they're investigating this: connect.microsoft.com/…/582525

  16. Anonymous says:

    In your section about Overlapping Tags, there are some pretty nasty cases (infinite loops during document order traversals) that arise from the parsing behavior in IE8 and earlier. See bryanmcquade.com/…/dom_traversal for an example. Thank you for fixing this broken behavior in IE9.

  17. Anonymous says:

    @Sarah  On the contrary, it allows for new semantic description standards to be developed. For example, perhaps a search engine wanted to allow developers to better semantically describe a page so that, say, the site's logo could be deciphered:

    <logo><img src="newlogo.png"></logo>

    Furthermore, it allows for backwards compatibility should a future extension to HTML5's standard semantic tags (header, footer, nav, etc.) be drafted in the W3C. Newer browsers (or other user agents, such as search engines) would recognize the new tag's semantics while older browsers ignored it.

  18. Anonymous says:

    With all this talk about "same markup" I really wonder why you didn't implement the whole HTML5 parser. Maybe you did this for Beta1/Preview5 (I hope, but doubt it), but as of now,

    * innerHTML still doesn't work on any select (Connect#571341) or table (Connect#582525) related elements.

    * innerHTML returns incorrect case of element and attribute names (Connect#584933, #584531).

    * name attributes on generic Elements are ignored (Connect#557785).

    * Incorrectly nested elements create an unexpected DOM and rendering (Connect#582974).

    Result: Web authors all over the world want you to Implement the HTML5 parsing algorithm (Connect#584766).

    And so many small isues are simply won't fixed. /sigh

    Still looking forward to the next preview though.

  19. Anonymous says:

    @badger: No you can’t. To develop a standard, well, you need to go through standardization. That’s not a one man thing. See Standardization in Wikipedia. You could participate at the WHATWG or W3C if you wanna do that. They are very successful.

    Your logo example can either work for just one search engine vendor (with damaging side effects perhaps) or for all. That’s the difference between proprietary extensions to a standard or evolving of the same standard.

    To see the seriousness of this: IE9 will just do what IE should have done in the first place with unknown elements. IE<9 has hindered the standard development a lot by its parsing of unknown elements not according to how it is defined. Because of this the WHATWG has a hard time with introducing new elements.

    So it is good that it will now be easier to invent new elements, but nobody should do it on their own. That would be be like one step forward, two steps back.

  20. Anonymous says:

    @Jonathan Seitel – as noted by many people here in the comments on this post as well as pretty much every post since it started the fixes you noted are all worthy and appreciated but the 2 places developers want the support most is setting the innerHTML on Select and Table elements (and all children of Tables)

    I'm quite shocked that you made a post about fixing innerHTML in IE – yet completely neglected to touch on the above bugs to discuss a timeline for when we can expect a fix.

  21. Anonymous says:

    I'm quite shocked that the trolls haven't gotten bored yet and found something else to do.

  22. Anonymous says:

    What about inheritance of old HTML4 styling attributes in tables?

    IE apparently is apparently doing some hackery there and some sites rely on that making life harder for the other browser makers:

    bugzilla.mozilla.org/show_bug.cgi

  23. Anonymous says:

    Just downloaded IE9Beta, and confirmed <select>.innerHTML is still broken.

    It's killing me to witness IE9 shaping up to be a very interesting piece of technology while at the same time failing to fix decade old bugs…