XML: Not Great for Documents, Actually

Norman Walsh, “Deprecating XML“:

“In short, if all you need are bundles of atomic values and especially if you expect to exchange data with JavaScript, JSON is the obvious choice. I don’t lose any sleep over that.

XML wasn’t designed to solve the problem of transmitting structured bundles of atomic values. XML was designed to solve the problem of unstructured data. In a word or two: mixed content.”

Daniel Lemire, “You probably misunderstand XML“:

“Thankfully, it appears that history is on my side. Developers got tired of getting these annoying XML payloads. In time, they started using JSON, a much more appropriate format for passing small loads of structured data between a server and an ECMAScript client…

Where does that leave XML at? Precisely where it started. XML is a great meta-example on how to deal with semi-structured data. And it is just as useful as ever. Want to deal with documents? DocBook and OpenDocument are great formats.”

Speaking as a technical writer who has spent the last several years authoring DocBook, customizing DocBook stylesheets, and writing build scripts around DocBook: no. No, DocBook is not a great document format. It is an acceptable document format.

First, I’ll grant that DocBook is vastly superior to the proprietary binary formats that came before it, for all the obvious reasons: you can diff it, you can check it into version control, you can edit it with just about any editor you like on any platform, the base toolset is free, et cetera. That’s kind of the minimum you want to see for any modern documentation format, and with that in mind, I am thrilled to use DocBook over, say, AuthorIT or Word. (I actually still miss FrameMaker, but DocBook wins there too, unless your only output format is a 7″x9″ perfect-bound book.)

Second, DocBook contains a number of critical design errors that, to be perfectly fair, have nothing to do with its XML nature. The fact that man pages require a completely different syntax from books and articles is not XML’s fault. The frightening size and complexity of DocBook’s vocabulary is (probably) not XML’s fault.

But even ignoring these flaws, DocBook is saddled with a number of XML-related problems:

  1. As an XML document format, DocBook has tied itself to the boat anchor of XSLT. The DocBook stylesheets expose a wide array of fiddly knobs to turn using parameters, and if you just stay in that space, you’re basically okay. But if you want to do a more aggressive customization — say, have a full TOC on the left hand side of every page — you are screwed. You either need to do radical field surgery on the stylesheets, with the patient screaming and fighting you all the way. Or you need to throw away the XSLT and transform DocBook some other way, starting over from scratch.

  2. As an XML document format, DocBook is an unproductive authoring environment. Even though I’ve been successfully hand authoring angle-brackety things for my entire professional career, and even though I’m aware of a vast array of tools to make hand authoring angle-brackety things easier, I am choosing to author this post in Markdown, not HTML or DocBook. Why? Because no tool is ever going to change the fact that creating a list by typing “n1. As an XML document format…” is easier than typing a slew of open and close tags.

    DocBook also penalizes you on reads as well as writes. DocBook is readable, in the sense that it can be read. Lightweight markup languages like Markdown and reStructuredText are readable, in the same way that a book is readable.

    And yes, there are a couple of decent semi-WYSIWYG DocBook editors out there. I even use one! It is so choice. If you have the means, I highly recommend picking one up. But that’s kind of the point — these tools are just an expensive patch to make up for the massive productivity loss of actually writing DocBook XML directly. (“But wait!” you cry. “I’ve set up all kinds of clever DocBook shortcuts and macros in my editor!” That’s swell. You are still wasting huge amounts of time. Stay the hell off my doc team.)

  3. As an XML document format, DocBook is forced to use draconian error handling. One mistake and parsing must stop, you’re done. The thinking behind this feature was that if you’re sending a transaction to your bank, and the payload is malformed, you don’t want the bank’s server “guessing” what you might have wanted.

    But novels and articles and technical manuals are not bank transactions. If I forget to close an inline tag, or I screw up a cross-reference, that’s not a good thing. But should a minor mistake break my doc build? For my team, I lean towards an answer of “yes.” You, however, write in a different environment, for a different audience, and perhaps have a different answer for how you would prefer to run your workflow. But screw what you and I think. The XML specification has made that decision for us.

    Oddly enough, draconian error handling was built for the very use case that XML evangelists now seem to be backing away from. Okay, JSON has superceded XML for data interchange on the web — but that’s just fine! XML was always for document formats, really.

    If XML is more or less popular than JSON among front end engineers, it’s really not a big deal to me. I don’t write web APIs for a living, I just write about them.

    But if people are arguing that XML’s core strength is in documentation, then as someone who writes documentation full time, let me say that this is just bunk. That might have been true in 1998, but today there are better choices out there. Markdown is simple and has multiple implementations. If you need something more powerful, reStructuredText + Sphinx has a rich feature set that in many places matches DocBook, but with far less complexity.

Not that DocBook is terrible. However, its XML lineage is not an asset. XML is not the best way to say, “Here is a paragraph. Here is a list item. Here is an index entry. Here is a cross-reference.” Don’t let anyone kid you into thinking it is.