Saturday 26 October 2002

To XML or not to XML?

“Something remarkably freeing about a good old-fashioned rant,” wrote Dorothea Salo, about her post on why PDF sucks. Something remarkably instructive, too, even for someone like me who—at the moment, anyway—doesn’t care much either way about PDF.

Dorothea’s thoughts on the longterm viability of archived data grabbed me by the throat. You see, I’m a chronic backer-up. I synchronize all my documents to a second internal hard drive at least once a day. I know what you’re thinking: “Who is he kidding? That’s not a backup!” I didn’t say it was a backup. It’s just a duplicate in case Windows eats some work in progress. The backups are to DAT tape, 1.3GB magneto-optical cartridge, and CD-ROM: four sets (one set each at two different friend’s houses, one set in a safe deposit box at the bank, the other set is the one in current use). Why? Because I’ve seen and heard too many tales of lost data. Is my data so important? Yes, it is. To me.

But, after reading Dorothea’s post, I suddenly asked myself: “What’s the point of backing up all this stuff if it’s stored in dead-end or proprietary formats?”

Which formats? Well, I’m not worried about HTML documents or my weblog content (which I can export as a text file from Movable Type in a matter of minutes). So that leaves:

I can export from QuickBooks Pro as delimited text, from StoryView as RTF, and from Biblioscape as delimited text, RTF, or HTML. But Dorothea’s favored formats are TIFFs for the images and “well-constructed OEB 1.2-compliant XML files” for the text.

Since JPEG will be around indefinitely, there seems little point in converting the JPEGs to TIFFs. If I convert the PSDs and PNGs to TIFF, I’ll lose the layers and effects. So I guess the crucial issue is whether or not I should convert the exported RTF files from Word, StoryView, and Biblioscape to XML. And, if I should, is there a reasonably-priced, reliable tool which will do the job?

Permalink

Comments

RTF is text-based, so it's a vast improvement over a binary. It's archivable, in my opinion. Maybe not ideal -- it depends -- but certainly archivable.

The tool you want is OpenOffice, by the way. :)

Posted by: Dorothea Salo on 26 October 2002 at 12:27 AM

PNG is the newest common image format, but I would be surprised if it went away soon.

Posted by: Mark Paschal on 26 October 2002 at 12:54 PM

Though, on the Word front, it seems that Office 11 will use XML-based file formats (http://www.sys-con.com/xml/articlenews.cfm?id=513), so you may be able to keep your Word documents just the way they are. Too early to tell if Microsoft will do something horrible and make their files impossible to work with, but it's a nice step in the right direction.

Posted by: John on 26 October 2002 at 10:43 PM

OpenOffice.org Marketing hat on. OpenOffice.org, the free open source office suite has had an open spec, native XML file format for the last two years. In the stable 1.0 version it's ZIP compressed but in the new developer builds there's an option to save as 'flat' XML if you have plenty of space or a special need for it.

The conversion to and from Word is pretty rockin as well. Email me if you want some more info. We like to chat.

Posted by: nick on 27 October 2002 at 04:02 AM

Mark, PNG has never really taken off (though Macromedia uses an extended PNG format as the native file format for Fireworks).

John, I'm still using Office 2000 and I can't imagine that Microsoft will offer anything in Office 11 that might persuade me to switch. (Call me cynical, but I can't help thinking that their XML-based file formats won't be standard XML).

Nick, impressive! Does the OpenOffice team have listening posts all over the world?

Posted by: Jonathon Delacour on 27 October 2002 at 09:42 PM

"What's the point of backing up all this stuff if it's stored in dead-end or proprietary formats?"

If we can decipher German and Japanese codes, especially designed for us not to be able to understand, then surely we will be able to decipher your formats, which are designed to be easily processable.

Posted by: Aaron Swartz on 29 October 2002 at 01:03 PM

Aaron, though you're almost certainly correct, I do like the idea of having my stuff stored in a relatively transparent format such as XML.

Posted by: Jonathon Delacour on 29 October 2002 at 03:12 PM

The JPEG standard is publicly available, as is, I think, the standard for PNG. So, at least for those two image types, it will always be possible for someone to build a decoder. The complaint about Word and PDF is not just that they are binary, but that they are closed standards, so if the company decides to stop supporting them, you could be left without a way to read them.

Posted by: Jim Sfekas on 30 October 2002 at 06:44 AM

PDF is far from perfect, but is not that bad. It is true that Adobe controls the future evolution of the standard, but they also publish pretty good documentation of the standard (in marked contrast with the story on Word), and there is also good free software for dealing with it, including Ghostscript (of which I am the maintainer) and xpdf. Given its popularity, there's as good a chance of it being easily read in 100 years as any of the myriad xml-syntaxed markup languages.

The difficulties with PDF are primarily that it's not editable, and that it's not at all a trivial job to get just the text out.

Even so, I agree that there's a _dire_ need for a standard, transparent document format capable of high quality rendering. Not only do the popular formats of today fall considerably short, but I don't see any of them heading in the right direction.

The one format that probably comes closest is (La)TeX. It's quite popular in science circles because of its excellent math formatting, but has many problems with usablity and other issues, so I don't see it catching on either.

But we can all hope!

Posted by: Raph Levien on 30 October 2002 at 08:23 AM

This discussion is now closed. My thanks to everyone who contributed.

© Copyright 2002-2003 Jonathon Delacour