There’s an interesting debate going on in the W3C HTML working group about whether well-formed HTML is important in the specification process for HTML5. It feels to me somehow intellectually that well-formedness is a valuable goal but when it comes down to explaining why it matters I’m finding it hard.

Which of the following is “better”:

normal<b>bold<i>bolditalic</b>italic</i>normal

or

normal<b>bold<i>bolditalic</i></b><i>italic</i>normal

The first is shorter (and works in all the popular web browsers) while the second is well-formed. Well-formedness isn’t about being smaller. It’s also not about performance: it turns out that the parsers in browsers often process certain non-well-formed mark-up faster than if it had been well-formed.

Since browsers have to parse both alternatives and the HTML5 process is about ensuring that they do so in a predictable and interoperable way then should there be any weight behind well-formed documents? After all, the spec doesn’t prevent you from choosing to be well-formed if you want to.

The analogy I’ve been considering is about indentation in C++ source code: few people would probably write C++ without a sensible indentation strategy to help make the code readable. Yet the C++ spec doesn’t need to say anything about indentation – it’s a best practice but not a formal part of the language definition. Could writing well-formed HTML be a best practice that’s not a formal part of the language definition?

Technorati Tags: ,
posted on Sunday, November 16, 2008 6:44 PM |
Comments
Gravatar
# re: Well-formed mark-up?
Posted by patrick h. lauke on 11/17/2008 10:21 AM

to me, it's not so much akin to indentation, but proper opening and closing of brackets to define logical blocks of code. if browsers have built-in coping mechanisms to deal with real-world non-well-formedness (malformedness?), more power to them...but i believe that it's an important aspect nonetheless.
Gravatar
# re: Well-formed mark-up?
Posted by Adrian Bateman on 11/17/2008 8:37 PM

Patrick: For some reason I believe that it's an important aspect too. I just can't articulate the importance. I kind of like the analogy of brackets - if compilers had a uniform way of inserting missing brackets would you want to leave them out? Probably not but then if that was the case would you penalise people that did choose (even accidentally) to miss a bracket where it still gave them the result they wanted?

What properties does a well-formed document have that make it more useful than one that isn't well-formed?
Gravatar
# re: Well-formed mark-up?
Posted by Laurens Holst on 11/19/2008 5:30 AM

Actually, if you’re making an analogy to programming languages, all of them use very strict syntax checking. So if you would take programming languages as an example, that would definitely vouch against HTML5’s ‘philosophy’ of allowing syntax errors. Which we would be better off without, IMO.

I think generally, lax error handling is bad software engineering practice. It leads to stupid, unforeseen bugs, which would have been caught otherwise. E.g. the case of people not escaping text content, relying on HTML’s ability to recover from & and < being used without being escaped, creates issues with edge-cases (e.g. ben & jerry works everywhere, but ben&jerry breaks in IE) that are hard to find just by testing.
Gravatar
# re: Well-formed mark-up?
Posted by Adrian Bateman on 11/20/2008 8:39 PM

Laurens: The problem is that we already have lax handling of HTML. We also have a solution: it's called XHTML and people are free to choose that. HTML5, though, has chosen a path to define how content that is already out there is interpreted. If we define a standard that doesn't incorporate much of the content on the web today, who would ever write a browser that adhered to the standard? They'd be blocking too much content to considered credible.

People who want to avoid those stupid, unforeseen bugs can use well-formed mark-up as a best practice. They can go so far as to use XHTML, validate that XHTML, and if they serve it as text/html still get good results across browsers. Sure it's not ideal but we are where we are.

If we have a standard that defines interoperably how incorrect mark-up should be treated then it seems like the only value to well-formed documents is in their ability to be read and understood by people. The machines will process them either way. That's why I suggest may it's just a best practice. It helps people read documents just like indentation helps people read code.
Post a comment
Title *
Name *
Email
Url
Comment *  
Please add 1 and 7 and type the answer here: