April 29, 2003

The XHTML 100

In the spirit of Marko Karppinen's The State of the Validation, here are the results of testing 119 XHTML sites for standards compliance. This is not a rigorous scientific exercise; the methodology had several shortcomings, some of which I detail below.

Most of the sites tested are the personal web pages of the "Alpha Geeks", an elite group of well-linked web designers and programmers, along with some of their friends. Because these are individuals, I do not plan to "name names" by publishing the exact list of URLs tested. Sorry. However, the general sample group is pretty easy to reconstruct. If you're the type of person who is interested in XHTML -- if you're the type of person who would waste time reading the rest of this post -- just look at your own blogroll, and start validating. Your results should be roughly the same as mine.

This post is divided into three sections:

Test Description

The tests derive from the first three criteria described in an earlier entry. I only tested sites that claimed to be XHTML -- in other words, I only validated sites that provided an XHTML DOCTYPE (or something that was trying to be an XHTML DOCTYPE, anyway.) I ignored sites that provided an HTML DOCTYPE or that didn't have a DOCTYPE at all. It would have been interesting to test HTML 4.01 standards compliance, but that wasn't what I was interested in.

The "fourth" test described in the earlier entry gets at the question of, "Why are you using XHTML in the first place?" I think this is a good question to ponder... but for this survey I thought it best to focus on the first three tests, which are less philosophical and more straightforward and mechanical.

For the sake of brevity, as soon as a site failed, I stopped applying all further tests. One strike, you're out.

The reasons why you should serve up your XHTML as application/xhtml+xml are well-documented. First and foremost, the spec says so:

The 'application/xhtml+xml' media type [ RFC3236 ] is the [emphasis not mine] media type for XHTML Family document types, and in particular it is suitable for XHTML Host Language document types....

'application/xhtml+xml' SHOULD be used for serving XHTML documents to XHTML user agents. Authors who wish to support both XHTML and HTML user agents MAY utilize content negotiation by serving HTML documents as 'text/html' and XHTML documents as 'application/xhtml+xml'.

Second, there's Hixie's famous article on the matter, which describes why you need to use the proper MIME-type. Personally, I think Hixie is a little too strict. He argues strenuously that serving up XHTML as text/html is wrong, and then relegates to Appendix B the concept of serving up different MIME-types to different user agents: "Some advanced authors are able to send back XHTML as application/xhtml+xml to UAs that support it, and as text/html to legacy UAs..." (A side note: this distinction about "advanced" authors is a little odd. First, as the results demonstrate, XHTML is hard enough that even advanced authors get it wrong most of the time. Second, configuring your server to do some minimal MIME-type negotiation really isn't that tough. If you're advanced enough to know what XHTML is, you're advanced enough to add a few lines to your .htaccess file. Or add a little PHP snippet for your dynamic pages. Et cetera.)

Anyway, without Hixie's Appendix B, we're stuck. If you serve up your pages as application/xhtml+xml to all browsers, you'll run into IE, which chokes on this MIME-type. The only non-suicidal thing to do is to serve text/html to the primitive browsers that don't understand the proper MIME-type, and application/xhtml+xml to the ones that do.

Data Collection

I collected results for 119 XHTML websites. I reviewed about half the sites on April 19, 2003, and the other half on April 20, 2003. I used Mozilla 1.3 to sniff for MIME-types, but for the majority of my testing I used Safari Beta 2, because of its superior speed and tab management. (A side note: for beta software, Safari performed extremely well, humming along smoothly with fifteen or twenty tabs open at once. It did consistently crash on a couple of URLs, which I plan to submit with the bug reporting tool.)

Finding 119 XHTML websites is not quite as easy as it first appears. At first I tried searching Google for terms such as "XHTML standards" or "XHTML DOCTYPE". But as it turned out, sites that talk about XHTML standards and DOCTYPEs are suprisingly unlikely to be XHTML sites.

I finally hit upon a method that yielded a reasonable percentage of XHTML websites. I went to the blogs of several very well-known bloggers who write about web standards: the "Alpha Geeks". I then methodically went through their blogrolls. Some observations:

Many Alpha Geeks (including some vociferous standards advocates) failed validation very badly, with dozens and dozens of errors of varying types. On the other hand, a few Alpha Geeks came tantalizingly, frustratingly close to validation. Typically this sort of failure would arise on the last page, where the author would make a tiny error such as forgetting to escape a few entities or inserting naked text inside a blockquote. I can certainly understand how these kinds of errors can creep in, no matter how diligently you try to avoid them. (And I can sympathize -- the blockquote validation error is a personal bugbear of mine.)

But it doesn't matter whether I feel bad or not. It doesn't matter if I think the errors are "small" or "forgivable". That has absolutely nothing to do with the specs, or the validator...

"Listen! And understand! That Validator is out there. It can't be bargained with! It can't be reasoned with! It doesn't feel pity, or remorse, or fear. And it absolutely will not stop, EVER... until you are validated!"

And, umm, on that note, let's get to the results.

Results

Of the 119 XHTML sites tested:

I know I promised not to name names, but I must make an exception. For the one man in the entire set who passed all three tests, let's hear it for... beandizzy! Yay beandizzy! At the time of this writing, beandizzy is reformulating his design -- but as of a week ago, his site validated perfectly and served up the right MIME-type. So congratulations, beandizzy. You have beaten the elite of the elite. You stand alone on the mountain top. (Well, there might be the occasional string theorist standing alongside you -- but really, physicists are best ignored.)

As for the rest, the results speak for themselves. Even among the elite of the elite, the savviest of the savvy, adherence to standards is pretty low. Note that this survey most likely overestimates adherence to XHTML standards, since you would expect the Alpha Geeks to rate high on XHTML standards comprehension.

Also, I have to admit that I grew rather emotionally invested in the test process. I figured twenty sites would be enough to get at least one compliant site. When that failed, I went on to 40, 60, ... amazed that not one site had passed. By the time I reached beandizzy's site (#98) I was pretty drained. I surveyed the rest of the blogroll I was on and then gave up. So again, this survey most likely overestimates XHTML standards adherence, because I quit soon after I got one success.

Conclusions are forthcoming. But there's one thing that's clear right off the bat: XHTML is pretty damn hard. If the Alpha Geeks can't get it right, who can?

Comments

  1. Here’s another.

    I submit my entire site, except for the photolog.*

    Also note that the forthcoming WordPress will be XHTML 1.0 Strict out of the box.

    • The photolog is using legacy software that I’ve managed to coax in places to a semblance of reliance, but to get it to approach anything approaching semantic would be a decidedly non-trivial endeavour.

    Posted by Matt on May. 05, 2003 at 10:24 PM [#]

  2. Bravo, Matt! I have been struggling this very week to get legacy software to spit out something resembling valid markup. So I’m certainly willing to give your Photolog a pass.

    So now there are three. Maybe I need to start keeping a list.

    Posted by Evan on May. 05, 2003 at 10:47 PM [#]

  3. If you wanted to be cruel set up a webring with invalid code required to be a part of. Interesting site BTW, I’ll be checking back.

    Posted by Matt on May. 05, 2003 at 11:02 PM [#]

  4. Interesting survey. I tested myself and found I failed at level 3, which was the final nudge I needed to add the PHP content negotiation code.

    Of course, this means that if I ever accidentally introduce badly formed XML in to my site it will die with an ugly error page in Mozilla. Hopefully I’ll be the first to notice.

    Posted by Simon Willison on May. 06, 2003 at 5:05 AM [#]

  5. And another… I finally got around to changing the content type of my root directory pages via php - seeing as there are only 5 pages in there, you can tell how much free time I’ve had lately thanks to sweet, sweet real life.

    And now….the rest…. can wait until the “big clean up” and log script finishing effort this summer. ;-)

    Posted by insin on May. 06, 2003 at 5:35 AM [#]

  6. I passed all three. For those of you using PHP, the code for level 3 is simple… Here it is:

    if ( stristr($SERVER[“HTTPACCEPT”],”application/xhtml+xml”) ) { header(“Content-type: application/xhtml+xml”); } else { header(“Content-type: text/html”); }

    This basically ensures that you can serve to the widest audience and still generate proper XHTML for those who can read it.

    TNL

    Posted by Tristan Louis on May. 06, 2003 at 8:18 AM [#]

  7. Hello, gentlemen! Just be careful, now… when it comes to the advanced browsers, you are flying without a net. Check out Jacques Distler’s post about invalid comments (if you haven’t already). His post is MT-specific, but it might prove useful in general:

    http://golem.ph.utexas.edu/~distler/blog/archives/000155.html

    Hmmm. I definitely need to start a list.

    Posted by Evan on May. 06, 2003 at 8:23 AM [#]

  8. I guess I’m more of a gamma geek since I just launched my blog and have very few readers. But I believe it would pass the tests.

    Thank you for taking the time to do this; it’s interesting/disconcerting.

    Posted by Jon Wiley on May. 06, 2003 at 11:46 AM [#]

  9. I don’t think it’s that XHTML is difficult - properly nest your tags, keep everything in lower case, quote your attributes and you’re 90% of the way there. Add the validator and it becomes extremely eassy to spot your mistakes.

    What’s difficult is maintaining XHTML validity on sites that are constantly updated with new mark up every day as new blog entries are added.

    Posted by Simon Willison on May. 06, 2003 at 2:43 PM [#]

  10. I’m surprised that you only managed to find one site that passed the test.

    Not trying to blow my own trumpet here (honestly), but for the past 5 or 6 months http://www.xiven.com has passed all three of the tests you mentioned (does that make me an Alpha Geek). It also sends valid HTML4 to non-XHTML-supporting browsers.

    Additionally, http://www.aagh.net/blog/ seems to pass the tests with flying colours.

    Posted by Tom on May. 06, 2003 at 2:54 PM [#]

  11. Tristan: Your code fails to take qvalues into account, and only works by coincidence on today’s browsers. See http://www.klio.org/marks/200304archive.html#entry-40 for a fuller explanation (and correct code in Perl).

    Posted by Mike Kozlowski on May. 06, 2003 at 8:01 PM [#]

  12. Jon: Check.

    insin: Check. And hey, no more bragging about sweet, sweet real life.

    Tom: Check. And check. Yes, I was surprised at the number of failures too.

    Tristan: The first couple pages validate fine, but there are a couple of problems with the “broadband” templates. Please do let me know if these get fixed. (As for qvalues: my criteria was simply that as long as Mozilla 1.3 says that it’s happy, I’m happy. :) )

    Simon: Yeah, I would have thought that it shouldn’t be too difficult, particularly after some practice getting the hang of the main gotchas: entities and proper tag nesting. But the numbers on pure validation alone are pretty dismal. A 90% failure rate.

    The problem is that keeping a site perfectly valid (or at least valid enough to pass test #2) takes either heroic diligence or very good tools. Both of which are in short supply.

    Posted by Evan on May. 07, 2003 at 12:06 AM [#]

  13. “XHTML is pretty damn hard” — HTML4 is just as hard. Try the same test with HTML, you’ll see what I mean. The only differences are that most people have had more time to get familiar with HTML, and the MIME type issues doesn’t exist any more.

    Give it a few years (like, ten or twenty). We’ll get there.

    Posted by Ian "I'm still using HTML4" Hickson on May. 07, 2003 at 6:15 AM [#]

  14. Paul Snowden of idontsmoke had a weblog that conformed to you rules a year ago (or more, I’m not sure when he took it down). Now he uses the minumum that validates as html4.01 strict instead.

    Posted by liorean on May. 08, 2003 at 1:32 PM [#]

  15. If you are serving the right mime type for XHTML 1.1 it would seem to me that keeping the site valid is pretty easy. Each time I screw something up, a good browser should refuse to render the page and it will give me an error. As long as I look at my pages after an edit I should be able to catch problems quickly, no need to re-validate because maximal standards compliance is its own validation.

    I suppose a page could be well-formed, but not valid. Can that occur under XHTML 1.1?

    Posted by Jon on May. 08, 2003 at 3:04 PM [#]

  16. “I suppose a page could be well-formed, but not valid. Can that occur under XHTML 1.1?”

    Absolutely! Evan gave his favourite example above:

    <blockquote>Spoons!</blockquote>

    is well-formed, but invalid.

    <blockquote><p>Spoons!</p></blockquote>

    is valid.

    Mozilla will render both, as they are both well-formed.

    Posted by Jacques Distler on May. 08, 2003 at 10:40 PM [#]

  17. Ian, liorean — yup, see, all the cool kids are switching back to HTML 4.01. :)

    Certainly from a pure markup perspective, HTML 4.01 Strict is pretty tough. I’d argue that it’s actually tougher than XHTML 1.0 Transitional.

    The latter is really about just quoting your attributes, closing all your tags, and moving on. But in HTML 4.01 Strict there are a lot of new and surprising tag-ordering issues (like the accursed blockquote problem) that you have to deal with. Also, HTML 4.01 Strict takes away a lot of presentational crutches — there’s no align=”center”, no border=”0” for images, and so on. It forces you to lean harder on CSS and maybe even think a little more semantically. Crummy old HTML4, who’d have thunk it?

    Posted by Evan on May. 09, 2003 at 12:20 AM [#]

  18. I don’t have too much of a problem with XHTML 1.1, other than the spec says no text/html for it, and I do for UAs that don’t support application/xhtml+xml. I even modify the meta tag so that it agrees.

    My problem is the stupid ampersand. Blogshares, for instance, requires invalid markup in a link. I usually wait a few days until the site has been claimed, and then change it back to what it is supposed to be.

    Posted by David on May. 09, 2003 at 2:18 PM [#]

  19. My blog (click on my name) should pass all three tests.

    Posted by Blo on May. 10, 2003 at 2:22 PM [#]

  20. Sorry for the double post, but I just checked and my site is also officially “Bobby 508 Approved”. Great post by the way Evan.

    Posted by Blo on May. 10, 2003 at 2:45 PM [#]

  21. Could somebody post step-by-step instructions for people that don’t know a thing about PHP, or even if it is supported by their server?

    Posted by Andrew on May. 11, 2003 at 6:16 AM [#]

  22. The latest post in my blog is about how to set your server up to display the right MIME type.

    Posted by Andrew on May. 11, 2003 at 8:24 AM [#]

  23. Tristan mentioned how to use PHP to send a proper MIME Type. here’s an extention of that (that i coded without knowing of this one.)

    <?php /* Due to some browsers spawning a vacum (read: != Gecko) we sometimes need to * send an incorrect MIME type * See: http://www.w3.org/TR/xhtml-media-types/#application-xhtml-xml * See: http://www.w3.org/People/mimasa/test/xhtml/media-types/results */ if ( stristr($SERVER[“HTTPACCEPT”],”application/xhtml+xml”) ) { header(“Content-type: application/xhtml+xml”); } else if ( stristr($SERVER[“HTTPACCEPT”],”application/xml”) ) { header(“Content-type: application/xml”); } else if ( stristr($SERVER[“HTTPACCEPT”],”text/xml”) ) { header(“Content-type: text/xml”); } else { header(“Content-type: text/html”); }

    This only sends an incorrect MIME type if i really REALLY need to, and as can be seen in those links some support text/xml & application/xml without supporting application/xhtml+xml so dont give up on sending something correct before you send text/html so soon.

    Posted by Bjarmason on Jan. 01, 2004 at 3:54 AM [#]

  24. This only sends an incorrect MIME type if i really REALLY need to

    But is that prudent? I don’t know if it really does it, but IE might well send an accept header that lists text/xml and application/xml, but it only displays them as an XML tree even if they are XHTML. That’s exactly the problem case described in the “q” problem btw. My page (as seen in the header) validates as XHTML 1.1, but is sent as application/xml. IE refuses to even display the tree because it cannot handle the XHTML 1.1 DTD.

    Posted by CornedBee on Jan. 06, 2004 at 10:52 AM [#]

  25. I agree with CornedBee, it’s best to be prudent. The browsers that support application/xhtml+xml are usually trustworthy* when it comes to their ACCEPT headers. The browsers that don’t, aren’t… and if you trust these poorly-designed ACCEPT headers, you’ll probably get burned. To be on the safe side, it is best to keep it simple send the “bad” browsers text/html. There’s no shame in that; after all, it’s not *your fault that certain browsers can’t be trusted.

    ** See the caveat on Jacques’s site: http://golem.ph.utexas.edu/~distler/blog/archives/000167.html. Jacques has found that embedding other XML content (i.e. MathML) doesn’t work with all browsers that explicitly support application/xhtml+xml in the ACCEPT header. So he actually has to list out all his “good” browsers explicitly. It’s a crazy world we live in.

    Posted by Evan on Jan. 06, 2004 at 12:39 PM [#]

  26. Hi folks,

    I’ve been working on XHTML on my site (on and off) for a few months and I think I pass tests 1 and 2.

    I don’t pass test 3, but I’m not sure I have to. I understand why I should, but the standard does /not/ say I must.

    I moved over to XHTML1.1 mostly. There’s a good reason for this: much less tag soup, much more separation of the presentation using CSS. To me, this was more important than passing test 3, which I knew about but chose to ignore for now (in the belief, as I said, that I am /not/ infringing the standard).

    I expend effort testing my site for well-formed, valid XHTML using Tidy (for all pages) and the W3C Validator (for spot checks).

    So although I don’t claim my site is perfect (realistically, it’s too large to be sure it’s perfect), I am moderately happy that I haven’t crashed and burned.

    …unless you know better ;-)

    Posted by Rick on Jan. 12, 2004 at 10:52 AM [#]

  27. It’s been several months since your post, but I may as well provide you a link to another XHMTL-compliant website. In fact, this one is XHTML STRICT, and it’s been compliant for well over a year:

    http://www.VoyagerRadio.com

    It may not be much to look at, but it’s compliant and accessible via wireless devices.

    I’ve found your journal, by the way, in my own hunt for XHTML sites. I’m looking particularly for Strict sites.

    Posted by Harold J. Johnson on Feb. 29, 2004 at 10:04 AM [#]

Post a Comment

(Optional, will not be shown)

(Optional)

Are you sentient?

Comments use Markdown syntax. HTML will be stripped out.

  • New para: two carriage returns
  • Emphasis: *emphasis*
  • Bulleted list: - bullet item
  • Ordered list: 1. numbered item
  • Blockquote: > "To the Batcave!"
  • URLs get automatically linked: http://yahoo.com
  • To create an inline link like Yahoo!: [Yahoo!](http://yahoo.com/)