Put Your Shades Away for a Minute

Matthew Mullenweg writes that aside from his Photolog, his site passes all three tests in the infamous XHTML 100 survey. It does indeed. I now count one, two, now three sites that pass. Four, if you count the W3C’s own XHTML pages. Goodness, the results are getting less horrific by the minute! So, Matt — like Jacques Distler and Beandizzy before you, I salute your diligence. Matt also jokingly suggests that I start up a webring of invalid sites. To that I respond: a webring that constitutes nearly the entire World Wide Web has to be overkill at the very least.

In the meantime, Phil Ringnalda has posted a nice summary of recent XHTML-related events. Phil’s comments on XHTML MIME-types is basically the following: he knows quite well that text/html has been given the W3C‘s official stamp of glaring disapproval (a “SHOULD NOT“). However, he still chooses to serve up text/html to all browsers for a couple of reasons, which I will summarize and hopefully not mangle:

  1. He’s scraping his own site, and the parsing is easier if he uses valid XML.
  2. If he or someone else posts anything to his site that is invalid, XML-aware browsers will choke.

Therefore, it makes no sense for him to use application/xhtml+xml, because that MIME-type provides no benefit given what he’s actually choosing to do with his XHTML.1

The first part is interesting — I hadn’t considered this possibility. Clearly the oft-touted concept that “XHTML will allow us to use off-the-shelf XML parsers on the web” is an idea straight out of Wolkenkuckucksheim. But Phil is restricting his parsing to a known subset of valid XHTML — his own site. Hard to argue with that.

As for the second part, the someone else is the key bit. It’s hard enough to keep your own XHTML pages valid, but when you have to clean up after other people too… ugh. I don’t have a great answer on this, although Jacques Distler might — he’s been struggling with this issue for weeks now. He’s finally compiled a big piece of the solution all in one place. If you’re an XHTML blogger, you need to read this article.

But Jacques is the type who is willing to hack into the MTValidate plugin to make sure that it validates properly. Not I. All I can do is reiterate what I’ve been mumbling for a while now: XHTML is hard. XHTML was designed to produce code that is stricter and more machine-friendly than its humbler predecessor, HTML. The machines don’t care if the error came from you or from some random person who sent you a Trackback. Either way, the parser will choke.

I’m still “thinking out loud” on this whole XHTML issue, but one thing is becoming clear. Reformulating your site as XHTML requires a major rethinking of everything you are doing. Proceed with caution. Ask yourself why. Phil Ringnalda has a reason: self-scraping. Jacques Distler has a reason: embedding MathML. Put your shades away for a minute: what’s your rationale?

1. I’d argue that Phil is therefore using the correct MIME-type. A) He does have a reason for marking his site up as valid XHTML, but B) he doesn’t really care whether you perceive his site to have any “XML-ness”.

14 thoughts on “Put Your Shades Away for a Minute

  1. “1. I’d argue that Phil is therefore using the correct MIME-type. A) He does have a reason for marking his site up as valid XHTML, but B) he doesn’t really care whether you perceive his site to have any ‘XML-ness’.”

    I think the point of sending out application/xhtml+xml is to certify that “this here is a well-formed XML document”. That means that the recipient *could* use an XML parser on it (and thus, perhaps, do neat stuff with it) if he wanted to.

    Now, Phil may not care, but if he *does* have well-formed XML (which he must, otherwise his own self-scraping tools would break), it sure would be neighbourly to tell people that.

    He may not see an immediate benefit. But, in the long run, if people start doing cool things with the (whatever little, but hopefully more and more) valid XHTML that’s out there, Phil will surely reap the benefits.

  2. I agree: sending out application/xhtml+xml is the final stamp of XHTML certification — you are broadcasting to the world, “This here *really* is XHTML. Yes, really. Just try me.” That’s what I meant by my vague term of “XML-ness”.

    I still think Phil is using the right MIME-type. Phil wants to have a site that he can parse easily. But he doesn’t want to guarantee to others that his whole site really is valid XHTML. In that case, the neighborly thing to do is to *not* stamp your site as XHTML. Better to stamp it as HTML.

  3. Let’s just keep in mind the distinction between “valid” and “well-formed”. Of course, it’s best to be valid XHTML, but really, the only *danger* of sending out application/xhtml+xml is lack of well-formedness.

    That will cause Mozilla (or, really, any XML parser, like Phil’s self-scraper) to choke and puke.

    I think well-formedness is an achievable goal. Whether Phil wants to advertise his well-formedness is his business. But, clearly, he must be achieving that much, so why not?

  4. Sadly, he may well not be: I think I’ve given you a false impression of how often and how thoroughly I self-scrape. It was the reason I switched to XHTML, not an ongoing and visible thing. I don’t actually parse all 800-odd pages daily, or reparse every page that displays a comment or trackback, every time one is added. Odds are good that I have several not-well-formed pages (not including the one where Mark Pilgrim was being cruel and adding unencoded ampersands earlier today), which I’ll fix the next time I need to parse myself. Chances are fairly good that everything’s valid, since commenters rarely use XHTML, and what they use is pretty minimal and safe.

    But requiring that I make data available in a particular format, and encourage others to do so, and only when it reaches some critical mass will someone write a cool app to take advantage of it, strikes me as being just like RDF/RSS. We’ve had two and a half years of RSS 1.0, and there are zero apps that take advantage of the RDF. We’ve had however many years of XHTML, and other than embedding MathML and SVG, I’ve yet to see anything that takes advantage of it, beyond private XML parsing and a suggestion that maybe Mozilla parses xhtml+xml faster than it does html (I don’t see enough pages with significant rendering time to know). View source on any TrackBack-enabled page or look at the Creative Commons recommendations and you’ll see how far we’ve come with embedding arbitrary XML from other vocabularies: the best way we have to include an RDF snippet in an XHTML page is to comment it out, making it inaccessible to XML parsers. XHTML+CSS was supposed to make us future-proof, but XHTML 2.0 is explicitly incompatible with 1.x. It was supposed to help our pages display on mobile devices, but now cell phones embed a full HTML browser, and handhelds ignore media types in style sheets and misrender screen styles.

    All that said (maybe you should shrink the textarea to encourage me to be brief?), I would be more than willing to serve application/xhtml+xml if it could be done safely, without my having to install a local copy of the validator. All I want is a web service that will validate arbitrary chunks of XHTML (I’ll even wrap them in a skeleton if it isn’t willing to), and a hacked up version of MT that will either validate comments, or if the service isn’t available accept them as drafts without publishing them, and validate entries or warn me that I have to be sure they are valid on my own. Given the non-trivial nature of installing your own copy of the validator, the current requirements for producing an xhtml+xml weblog seem a little steep for something that’s only of value to people who need to embed MathML or SVG for readers using Mozilla.

  5. I sympathise 100%.

    The only way this “bright, shiny future” will arrive is if using these technologies becomes easy enough and idiot-proof enough that a critical mass of people use them (and, whether they realize it or not, use them correctly).

    To put some perspective on this. I had a conversation with a friend (former grad student, actually) who now works for the American Physical Society. They’re *big* backers of MathML. But his take on that particular technology was that it is perpetually “two years away”.

    Browser support is nearly nonexistent (Mozilla’s support is, charitably, marginal). Authoring tools are nearly nonexistent. My MT plugin, based on Paul Gartside’s program is the first halfway usable tool I’ve found.

    So, basically, *nobody* uses MathML for anything but demo pages. Want to see the “state-of-the-art”?

    http://www.mozilla.org/projects/mathml/demo/

    Pathetic, eh?

    What does this have to do with the MIME-type you serve your blog with?

    Somebody’s gotta lead.

    We can slap all the XHTML DOCTYPEs on our sites that we want. But if it ain’t valid, if it ain’t even well-formed, then that’s all a lie. And none of the benefits, real or imagined, of these technologies will accrue.

    You may be willing to fix Mark Pilgrim’s mischief by hand. You may even be willing to click on the “validate” button and go back and re-edit your post till it validates. But that “critical mass” is not.

    If you can get your blog to the point where you feel comfortable serving it as application/xhtml+xml, then we can assume with some reasonable confidence that the technology has matured to the point where it might be usable by the masses.

    I showed how to auto-validate comments. I think you over-estimate the difficulty of implementing my solution. It took me all of an hour to get the basic setup working. I then spent several more hours tweaking the templates and squashing bugs in the validator. As you can see from the timeline of the posts on my blog, I started late-afternoon on April 30, and pretty much had the whole thing in final form a day later.

    I’d love to have a solution for validating my own posts too. Kung-log has some pretty strange behaviour vis-a-vis encoding of XML entities, and my writeup http://golem.ph.utexas.edu/~distler/blog/archives/000155.html took a ridiculous number of iterations before I got it “right”.

    It’s the Joe Ringnalda’s of the blogsphere who will make this stuff work. If you don’t think it worth the challenge, ain’t nobody else gonna do it.

  6. Jacques: I would like to think that stamping a document with “application/xhtml+xml” certifies to the world that your document is true XHTML: both well-formed and valid. Sure, Mozilla doesn’t validate all its pages (a wise engineering choice). But regardless… is it really a good idea to say that it’s okay to be well-formed but not valid? People who build nifty XHTML-parsing-and-transforming tools should be able to *count* on your document having the correct structure as specified by the DTD. Don’t you think?

    Not that these nifty tools are exactly thick on the ground.

    Phil: Your last sentence is pretty much the crux of the issue. The state of the XHTML-generating tools are poor. The state of the XHTML-consuming tools are arguably worse. The technical benefits of XHTML for the average weblogger are very small.

    Jacques and others are working on improving the first bit, which is a good start, at least. But we have a ways to go before anyone but the physicists are going to be really *using* this crazy new-fangled technology. (And thus, in the year 2003, we’ve come full circle.)

    I wonder… if SVG plugins were as ubiquitous as the Flash player, would *that* be enough to justify more XHTML adoption? Just thinking out loud here.

  7. I’m not *advocating* sending out invalid, but well-formed XHTML.

    All I’m saying is that the expat XML parser won’t bat an eyelash, and neither (up to a point) will Gecko (which uses expat when served application/xhtml+xml) if you do.

    If it’s well-formed, it will make it through expat OK. What happens next really depends on the client application. It could crash on perfectly valid input (as Camino does with my XHTML+MathML) or it could “fix” mildly invalid input just fine (like, say, when you put inline text directly inside a <blockquote>). It really depends.

    But you are asking a strange question: what MIME-type should invalid (X)HTML be sent as?

    How about text/invalid-xhtml? 🙂

    I think that’s the wrong question. The right question is: how can I have reasonable assurance that I’m sending out *valid* XHTML?

  8. text/invalid-xhtml: Yes! Now there’s a MIME-type whose time has come!

    What I was trying to say earlier (and said badly) was that I think application/xhtml+xml should mean “well-formed and valid”. That should be our stamp of assurance. If Mozilla only checks for well-formedness, that doesn’t matter — that doesn’t take away from what that MIME-type really means. Or at least, what it means over here in Evan-land.

  9. “All of an hour.” I got your all of an hour, right here, Mr. Master Of Your Own Server. Three hours in, and I get no errors in the error.log, every page is reported as invalid but no errors are shown, and I get a warning on line 501 of MTValidate.pl about use of an uninitialized value, which feels to me like a problem with the tempfiles, but I’m just awful about debugging Perl – either it works, or I’m dead.

    Also, I’m beginning to really hate the “build the data, and the apps will come” plan anymore. If something will be cool when everyone (serves application/xhtml+xml|uses RSS 1.0|marks everything up with RDF), then it will also be cool enough with a tiny faked dataset to show people why they should feed it more data.

    Sigh. Back to staring blankly at Perl.

  10. “I get a warning on line 501 of MTValidate.pl about use of an uninitialized value”

    Have you got the path to the SGML parser set correctly in plugins/validator/config/validator.conf (taking a wild stab at debugging a 1-line warning message)?

    Alexei also has a typo in the 1st line of his plugin. He says:

    # We need Perl 5.6.0+.
    use 5.006;

    Clearly, the “use” line is incorrect.

  11. Correct path (otherwise it dies saying that it’s not executable), but not a working installation. Finally got smart enough to use Data::Dumper to see the whole output, and onsgmls is dying with a couple of utterly incomprehensible errors.

    Oddly enough, I don’t think that use 5.006; is incorrect. A little googling seems to say that, in the weird world of Perl, use v5.6.0;, use 5.6.0; and use 5.006; are all exactly the same. TMTO*&!*(*(#$WTDI and all that. Anyway, I’m out.

  12. Sorry ’bout that.

    Since I used Fink to install my copy of OpenSP, I’m not even sure how easy or hard it is to compile by hand.

    I really hate the thought of the time you must have wasted on this.

    I guess we’ll have to see whether other people have similar troubles.

    A quick scan of Google leads me to the impression that, aside from Alexei, I may be the only one currently using the MTValidate plugin. And I’m certainly the only one using it to validate comments. Hopefully, that will change.

Comments are closed.