Tonight we're gonna parse like it's 1997

Via Sean McGrath comes a reasonably lucid and comprehensible redux of the argument about of whether or not the XML standard should (or should have) stipulated draconian error handling. I hope I'm not misrepresenting Avery when I boil a lot of it down to his three broad "real-world" examples to this:

  1. Not well-formed XML, produced by a legacy application that takes ages to fix, is rejected by draconian parser
  2. Not well-formed XML is accepted by a permissive parser
  3. Well-formed XML is accepted by draconian parser

and I hope he's also happy for me to then state that his argument consists broadly of the suggestion that 1 and 2 are together more likely than 3, hence permissive parsers obtain for you the lion's share of the "real world" parsing instances; or, if you prefer, via a slightly more complicated profit-and-loss argment, that making your parser permissive, and sanctioning permissive parsers, contributes a lower overall cost through lumbering us with poor legacy applications, divided up among all the parsing events, than having to fix those legacy applications.

However, an application of Postel's Law to the process of implementation should not be confused with being able to apply it to the original specification. And besides, do those examples really portray the real world, nearly twelve years after the argument first took place? How much XML is out there, and how much of it is bad XML, and how much of it remains bad XML for long enough for it to cause a problem? I don't think it's clear that draconian error handling in the wild has held back RSS syndication, Google Maps, web services, or RDF so much as that, beyond a certain tipping-point (say, 2002?) they've ensured the rapid takeup thereof (with the possible exception of RDF until recently, for its own reasons).

XML is unbelievably popular today, so popular and routine in its use that you almost don't know it's there in most applications. and I think---purely from my own experience---it's plausible to suggest that that's at least partly because consumption of XML is easy; in turn, this is because basic production quality is enforced.

HTML (SGML) is a format (specification) that, because of its messiness (its complex rules), its parsing permissiveness (its potential for misunderstandings), and a whole host of cultural reasons (ditto), was terribly hard to write reliable consumption software for. Even now, there's around half a dozen good browsers, and in part that's surely because of the entry barrier to writing browsers: permissive parsing of real-world mistakes remains a complex task.

I also have dim and partial knowledge of SGML in the old-skool publishing industry, where a licence for fully-featured SGML software could set you back tens of thousands of pounds six or seven years ago, and that price didn't seem to be heading down under market pressure. In comparison, XML parsing is cheap, easy, and ubiquitous. There are free and open-source CMS and blogging packages that can do it; I have access to dozens of command-line tools that can do it; publication, syndication and webservice consumption are things that happen, almost as though nature intended it that way. A lot of that must surely be down to XML's combination of rule simplicity and parser rigour. As Dave Winer says on the subject of Postel's Law and XML:

I yearn for just one market with low barriers to entry, so that products are differentiated by features, performance and price; not compatibility. Compatibility should be expressed in terms of formats, not products.... Anyway, the other half of Postel's Law is just as interesting, but so far no one is commenting. Think about it, if everyone followed the second half, the first half would be a no-op. You could be fully liberal in an afternoon or less.

Mark Pilgrim's history of draconianism versus tolerance seems to consist of a lot of tolerantists pontificating about what they've decided the "draconian" argument is: I can't believe that Tim Bray, even if he really were a lone voice, would have been such a reluctant paper tiger. But like the 1997 tolerantists, I've thus far waded in with my own interpretation of events. And despite dealing with XML on a daily basis, I find that during so many of the tasks I have to accomplish the XML layer is able to fade almost completely into the background.

Of all the problems I encounter at work well-formedness of XML happens very rarely, compared to those concerning the quality and stability of my own algorithms, application control flow, scaling and coping with heavy load, and logging and bailing out. Whether XML's ease of use is in 2009 is a result of the small rule set in XML making well-formedness easy, or the initial decision in favour of draconian parsing, all decided back in 1997, we'll probably never be able to tell. All that's certain is that there'll always be opinions about it, and somewhere in the rambling above is mine.