XML: Not Great for Documents, Actually

Norman Walsh, “Deprecating XML“:

“In short, if all you need are bundles of atomic values and especially if you expect to exchange data with JavaScript, JSON is the obvious choice. I don’t lose any sleep over that.

XML wasn’t designed to solve the problem of transmitting structured bundles of atomic values. XML was designed to solve the problem of unstructured data. In a word or two: mixed content.”

Daniel Lemire, “You probably misunderstand XML“:

“Thankfully, it appears that history is on my side. Developers got tired of getting these annoying XML payloads. In time, they started using JSON, a much more appropriate format for passing small loads of structured data between a server and an ECMAScript client…

Where does that leave XML at? Precisely where it started. XML is a great meta-example on how to deal with semi-structured data. And it is just as useful as ever. Want to deal with documents? DocBook and OpenDocument are great formats.”

Speaking as a technical writer who has spent the last several years authoring DocBook, customizing DocBook stylesheets, and writing build scripts around DocBook: no. No, DocBook is not a great document format. It is an acceptable document format.

First, I’ll grant that DocBook is vastly superior to the proprietary binary formats that came before it, for all the obvious reasons: you can diff it, you can check it into version control, you can edit it with just about any editor you like on any platform, the base toolset is free, et cetera. That’s kind of the minimum you want to see for any modern documentation format, and with that in mind, I am thrilled to use DocBook over, say, AuthorIT or Word. (I actually still miss FrameMaker, but DocBook wins there too, unless your only output format is a 7″x9″ perfect-bound book.)

Second, DocBook contains a number of critical design errors that, to be perfectly fair, have nothing to do with its XML nature. The fact that man pages require a completely different syntax from books and articles is not XML’s fault. The frightening size and complexity of DocBook’s vocabulary is (probably) not XML’s fault.

But even ignoring these flaws, DocBook is saddled with a number of XML-related problems:

  1. As an XML document format, DocBook has tied itself to the boat anchor of XSLT. The DocBook stylesheets expose a wide array of fiddly knobs to turn using parameters, and if you just stay in that space, you’re basically okay. But if you want to do a more aggressive customization — say, have a full TOC on the left hand side of every page — you are screwed. You either need to do radical field surgery on the stylesheets, with the patient screaming and fighting you all the way. Or you need to throw away the XSLT and transform DocBook some other way, starting over from scratch.

  2. As an XML document format, DocBook is an unproductive authoring environment. Even though I’ve been successfully hand authoring angle-brackety things for my entire professional career, and even though I’m aware of a vast array of tools to make hand authoring angle-brackety things easier, I am choosing to author this post in Markdown, not HTML or DocBook. Why? Because no tool is ever going to change the fact that creating a list by typing “n1. As an XML document format…” is easier than typing a slew of open and close tags.

    DocBook also penalizes you on reads as well as writes. DocBook is readable, in the sense that it can be read. Lightweight markup languages like Markdown and reStructuredText are readable, in the same way that a book is readable.

    And yes, there are a couple of decent semi-WYSIWYG DocBook editors out there. I even use one! It is so choice. If you have the means, I highly recommend picking one up. But that’s kind of the point — these tools are just an expensive patch to make up for the massive productivity loss of actually writing DocBook XML directly. (“But wait!” you cry. “I’ve set up all kinds of clever DocBook shortcuts and macros in my editor!” That’s swell. You are still wasting huge amounts of time. Stay the hell off my doc team.)

  3. As an XML document format, DocBook is forced to use draconian error handling. One mistake and parsing must stop, you’re done. The thinking behind this feature was that if you’re sending a transaction to your bank, and the payload is malformed, you don’t want the bank’s server “guessing” what you might have wanted.

    But novels and articles and technical manuals are not bank transactions. If I forget to close an inline tag, or I screw up a cross-reference, that’s not a good thing. But should a minor mistake break my doc build? For my team, I lean towards an answer of “yes.” You, however, write in a different environment, for a different audience, and perhaps have a different answer for how you would prefer to run your workflow. But screw what you and I think. The XML specification has made that decision for us.

    Oddly enough, draconian error handling was built for the very use case that XML evangelists now seem to be backing away from. Okay, JSON has superceded XML for data interchange on the web — but that’s just fine! XML was always for document formats, really.

    If XML is more or less popular than JSON among front end engineers, it’s really not a big deal to me. I don’t write web APIs for a living, I just write about them.

    But if people are arguing that XML’s core strength is in documentation, then as someone who writes documentation full time, let me say that this is just bunk. That might have been true in 1998, but today there are better choices out there. Markdown is simple and has multiple implementations. If you need something more powerful, reStructuredText + Sphinx has a rich feature set that in many places matches DocBook, but with far less complexity.

Not that DocBook is terrible. However, its XML lineage is not an asset. XML is not the best way to say, “Here is a paragraph. Here is a list item. Here is an index entry. Here is a cross-reference.” Don’t let anyone kid you into thinking it is.

8 thoughts on “XML: Not Great for Documents, Actually

  1. I couldn’t agree with you more. I read something ages ago about how XML’s ubiquity has more to do with Java’s deficiencies in object storage than any inherent strengths of the format, and I couldn’t agree more. For small data transfer: JSON. For large-ish data exchange: CSV/passwd format. For documents…

    Man, I don’t care much for any of the entrants. This is still a space that has room for innovation.

  2. Auros — the title of this post was originally going to be, “I Think I Understand XML Just Fine, Thanks”, but I decided to dial it back a bit. 🙂

    Jemaleddin — the Java story makes sense to me. I am just old enough to remember when XML was coming out… boy oh boy was Sun trying to hop on that bandwagon fast.

    I agree there’s lots of room for improvement. The good news is that the lightweight markup formats can innovate much faster than the XML doc formats can. Think about how much easier it would be to contribute to Sphinx versus diving into a morass of XSLT.

    On the downside, right now it’s only the independent programmers who are driving innovation in documentation… which means they will stop when their itch is scratched. Tool/API writers like myself will be in fine shape — we know how to install things and run make. But people writing end-user docs will be left out in the cold. They need something slicker and more push-button. Like Scrivener, but for technical manuals (and with a good file format underpinning the thing).

  3. Pretty much all the process control products acquired an “export to XML” feature in the early 00s and I had to explain to people that it was about as useful as “export to CSV”.

    Yeah, it’s textual data; yeah, it’s kind of human readable; no, it doesn’t make it any easier to actually do anything with your systems and we’d rather just reverse-engineer the binaries as usual, thanks.

  4. It’s hardly going to be surprising that I consider the problems that you enumerate to be features.

    That said, if you don’t like DocBook, use something else.

    That said, I stand by my convictions that mixed content is important and XML is vastly superior to JSON for handling mixed content.

  5. Norman, I’m honored!

    As I said up top, DocBook is not a terrible documentation format — which puts it well ahead of many other formats that are wide use today. In fact, there was a long period where DocBook was really the only sensible choice for any kind of technical documentation.

    That said, the world is moving forward. I *strongly* encourage you and the other core DocBook developers to take an in-depth look at the lightweight markup languages that have grown up in the last 5-10 years. At this point it is just far, far easier to author and customize a ReST + Sphinx manual than a DocBook manual. The good news is that there are a lot of fixes you and the DocBook team could make to close the gap, even without changing DocBook’s fundamental XML nature (a non-starter, obviously).

  6. This may be slightly off topic, but there’s a great codebase documentation auto-generator called Doxygen that parses in-code java-style comments, and parses markdown into nice-looking codebase docuentation.

    I’ve never done any other type of technical writen, almost all of my documentation is java-style inline comments. 😛

Comments are closed.