I found the reactions on Slashdot (and my blog) to be rather fascinating.
First off, I'm not the person who did the research. That was Michal, and his work was (as usual) quite good. I'm flattered that some people thought that I was something more than the messenger, but...
The first poster on Slashdot, Eponymous Cowboy, got it totally right, and I'm really happy about that. His comment was (paraphrased):
Since it may not be obvious to all readers, be aware that when you can make a program crash by feeding it bad data, you can typically further manipulate the data you are sending it to take control of the program. That means a security hole. This is how buffer-overruns work. You can't always do it, but you can think of each way you can crash a program as a little "crack" in its exterior. If you can figure out a way to pry apart the crack, you've got yourself a hole.
He's totally right. And the number of people who seem to believe that it's better to crash than to fail on invalid input distresses me immensely.
Some of the Slashdot comments indicated that Michal's test was somehow "unfair" because it tested things like null characters in the middle of strings. The thing is, the bad guys are going to try that too. And if your browser can't deal with it, then...
If we lived in a world where the bad guys were forced to write valid code, then that attitude would probably be ok, but that's not the world in which we live. The bad guys aren't going to use valid HTML to attack your web browser. They're going to use invalid HTML. So it's critical that your browser deal with both valid AND invalid HTML.
Your browser doesn't always have to render the invalid HTML, but it needs to deal with it in SOME reasonable manner. If it drops the connection and throws up a "This is not strict HTML, I won't render it" dialog box upon seeing its first invalid sequence, that's ok (it may not be a useful customer experience, but that's a different issue). But it can't crash.
If it crashes, then you've just let the bad guys own your customers machine. At a minimum, they're going to be able to launch a DoS attack against your customers. At the worst, they're going to own your customers machines.
One of the basic tests that MUST be performed on any exposed interface is what is known as "fuzzing". Michael Howard and David LeBlanc wrote a about it in their "Writing Secure Code", but essentially fuzzing an interface means taking the interface definition and varying the input.
So if I wanted to fuzz HTML, I'd start by taking legal HTML, but start injecting invalid characters into the input - I'd try null characters, characters outside the ASCII (0-127) range, illegal UTF8 characters. I'd then try arguments to ATOMs that were really long (longer than 64K, for example). I'd try eliminating trailing tags, or nesting tags really, really deeply. Etc.
The interesting thing about fuzzing an interface is that it can be automated, as Michal pointed out.
The other thing about fuzzing is that it should be a basic part of any security testers arsenal. Your code can't be considered to be secure unless its external interfaces have been fuzzed.