I have often thought about the fact that I daily entrust increasingly more of my mental functions to my computer. For better or worse, this is the case. If a calculator is conveniently nearby, for example, I will use that instead of estimating, because it’s faster.

But no place is this more pronounced than with my memory. I have a good memory for process—how to do something—and a lousy memory for specific details about space and time (where did I put that? when did I say that and what did I say?) This makes me, of course, almost a perfect candidate for a computer geek, having the computer complement me by transforming my good process memory into its own faultless space-time memory.

Just last week I discovered a massive impediment to this approach: proprietary binary file formats. Cursed may they be forever. Ironically, “forever” is the one thing you just can’t say about such formats. I’m talking about Microsoft Word and Works, about AppleWorks, about SuperPaint, about MacDraw, all of them. They are shockingly temporary. I’m not the only person who has noticed this, but what follows is my own experience and my own recommendations.

The Incident

A few weeks ago I found myself browsing some very old folders on my main drive. These were copies of disks dating back to 1988 and earlier, when I was in college. Back in those days I generally used a Macintosh SE running Microsoft Word. I wrote letters, papers, journal entries, everything in this format. I also drew graphs and figures for various projects using something called SuperPaint.

In those days, computers were considered desktop publishing machines primarily, so there wasn’t much thought given to the intermediate form. The idea was that you would eventually get to a printed page and then that was what mattered. In some ways, a superior approach, since the printers were often laserprinters, making very long-lived paper records.

However, I did keep my old 3.5" disks, and when I bought an iMac in the summer of 2000, one that lacked a 3.5" drive, I used an old PowerBook and an Ethernet cable to transfer all of the files to the iMac, one folder per disk. I browsed a couple of the files, at the time, on the PowerBook, which of course had a sufficiently ancient version of Word that I was able to view them easily and without any concern about the future.

You see, I had this silly idea that the Microsoft Word format was somehow stable and fully backward compatible, in other words, that all later formats were simple supersets of the earlier ones so that as long as I had a copy of Word, I’d be able to read any Word document ever written, ad infinitum.

Silly boy! What a totally untrue assumption!

Microsoft wants you to keep buying new copies of Word. They are not interested in the longevity of your memorabila. The way they do this is by making new versions of Word incompatbile with the old, so that when your friends start sending you Word 2000 documents and your Word 95 can’t read them, well then it’s time to upgrade, isn’t it?

I know that new versions of Word will be backward compatible to a certain horizon (maybe able to import 2 or 3 versions into the past), but who would possibly consider making sure that you could read hoary old Microsoft Word 3 for Mac versions? Unless they were committed somehow to eternal backward compatibility, it just wasn’t going to happen. And it didn’t.

So here I am, in 2002, trying to read these documents. I don’t own Word, I own AppleWorks 6, which can import Word documents. But the old docs were beyond its horizon too.

I tried MacLinkPlus, my faithful friend. It also choked, and choked hard, on the old versions, actually crashing. Since I was having to run it under Classic mode, each document that it choked on required a restart of Classic and about 5 minutes wasted time.

I went to upgrade MacLinkPlus, paying about $40, so I could run native under OS X and perhaps pick up some bug fixes. Yes, this time it converted the Word 4 docs and the SuperPaint pictures, but simply passed over and refused to convert the Word 3 documents.

I eventually had to resort to opening the Word 3 docs in Emacs and manually extracting whatever ASCII text survived. Fortunately this was most of it, but why should I be forced to do this kind of archaeology on a letter less than 20 years old?

I am sure that some of my conservative computer science friends are chortling at me for entrusting my memories to anything but ASCII. But I like styles, margins, and tab stops, and I don’t see why something so simple has to be encoded in such a complex manner.

What is the reason for this? It is because the documents outlive the software/hardware combination that created them.

Searching for Permanence

I have read some articles criticizing magnetic media, comparing it to paper and microfilm, pointing out that if not refreshed, magnetic media only lasts for about 75 years, versus the many centuries that paper can last. But this is eons compared to the lifetime of binary file formats! I have documents not even twenty years old, written in Microsoft Word version 3, that I cannot read anymore! I was able to recover the ASCII text from them using Emacs, but not the formatting or fonts.

Let’s take a moment to sing the praises of ASCII text. For those of you who don’t know, ASCII stands for American Standard Code for Information Interchange. Some email readers and word processors call this “plain text”. Its weaknesses are that it doesn’t support any fonts or formatting or international characters. But its amazing strength is that it seems impervious to the depredations of versioning and upgrading. (One portability problem it does have is carriage returns: Windows/DOS uses CR/LF, Mac uses CR, Unix uses LF, and VMS uses LF/CR. So it isn’t perfect.)

What’s one to do if one wants to preserve styles (italics, boldface, superscript, subscript) and paragraph formatting (line spacing, justification, tab stops, hanging indents), as well as different fonts and sizes? The purpose here is to be able to author and save documents containing roughly the same content as a newspaper column might have. I’ll discuss graphics formats later. Are there any formats that might withstand upgrading, which are built on top of ASCII? It’s also important that the format permits the editing and continued authorship of the document. PostScript is a great portable document display format built on top of ASCII and universally understood, but it does not work well for storage and editing. It is generally used as a temporary transmission between a printer driver and a printer. It is also too page-oriented, that is, too concerned with the actual placement of items on the page, rather than with original authored input. (PDF has the same advantages and disadvantages, but is more widespread, especially for electronic distribution.) This leaves about three candidates.

Candidates for Text Document Storage

  1. Structured text: This is simply the use of some conventions in a regular ASCII text file. Frequently the choice of Usenet users, since Usenet posts are still just bare ASCII. The conventions are things like: indicating italics by surrounding words with asterisks, using pluses or asterisks for bullets, and so forth.

    It has all the advantages of ASCII, with the only disadvantage being that is doesn’t render in any letter-quality way. (You can, however, use docutils to convert it to HTML, then print it from a browser.) There are not many editors that make it easy to edit and continue authoring these documents, because you have to maintain all the line formatting yourself—if you add words in the middle of an existing line, you have to manually reformat the rest of the paragraph. Editors like Emacs sometimes have modes that greatly simplify this process.

  2. Rich Text Format (RTF): This is my current favorite, for three reasons: (1) when I was a computer lab monitor at the U of A, RTF seemed to be the most capable interchange format, preserving text qualities the best between word processors and platforms; (2) of all my 20-year-old word processing documents, the RTF ones stood the test of time the best, maintaining my fonts, indentation, styles, etc.; (3) it’s the native storage format of Mac OS X’s TextEdit program, and I’m a big Mac OS X fan. It is also the native format of Windows’ WordPad, which should help keep it from vanishing.

    One thing I have against it is that it’s complex. It’s not easy to imagine writing my own program that could recover RTF. It’s nice that many exist, but it seems like a technology that I’m dependent on, less like an open and readable format.

    I also discovered to my shock that an RTF document containing bullets at the front of each paragraph, created on my Mac, did not translate the bullets when imported in StarOffice for Solaris. Why? Because a bullet is not part of 7-bit ASCII, RTF “punts” and simply encodes the source platform’s own extended-ASCII code for that character.

    You are probably very familiar with this same problem in another context, when viewing web pages that seems to have weird question marks where apostrophes and quote marks ought to be. This is because HTML generators punt in the same way. Your word processor has “smart quotes” turned on, which of course makes it look “dumb” to anybody reading it on a different platform. This particular document’s quote marks should look good on any platforms, because it is encoded with UTF-8 text encoding. What is this? Next item...

  3. Unicode: This is the heir apparent to ASCII, particularly the UTF-8 encoding of it. In ASCII, each character fits in a single byte, and only the lower 7 bits of the 8 bits in a byte are truly standard. This results in 27 = 128 total standard characters. In Unicode, there are two bytes per character, resulting in 216 = 65,536 total standard characters! This is enough to accomodate pretty much all of the orthography of every written language. Plus all kinds of neat math symbols, typesetting symbols, and so on. The most obvious way to write Unicode is to actually use two bytes per character. However, this is not backward compatible with the reigning champ, ASCII. The UTF-8 encoding of Unicode is an excellent solution to this: if the top bit of a character is 0, it assumes the next 7 describe an ASCII character; otherwise, it describes the extended Unicode character using more than one byte. It gets more complicated from there, but all you have to know is that ASCII text still looks the same when viewed as UTF-8. And someone looking at Unicode using an ASCII text editor will be able to read most of the text (if it’s American/European).

    Unicode is the winner when considering how to store bullets, em-dashes, “smart quotes”, accents and other diacritic marks. However, that is all Unicode is about. It does not store font, style, or margin information.

  4. Hypertext Markup Language (HTML): If you don’t know what this is (and who are you?), this is the format for most of the documents on the World Wide Web. It was invented for people like me. One great thing about HTML is that it is possible to author it by hand, and that its complexity scales. In other words, simple documents look simple and the more complex tags are only needed as your document increases in complexity. Because of this, simple HTML documents should have tremendous longevity. I don’t know about the extremely complex documents: if the software to render them ever perishes, those documents may be nearly unrecoverable as well.

    Note that HTML documents can be written in Unicode, getting, perhaps, the best of both worlds.

    HTML documents can also be accompanied by a Cascading Style Sheet (CSS) document, allowing you to separate the font and style attributes of a document from its semantic content. This is nice, and resembles the use of named style sheets in a program like Word. As with HTML itself, CSS does not need to be nearly as complex as what you normally run across on the Web or what an HTML editor generates.

    The problem with authoring HTML by hand it that it can often be difficult to see whether you are writing it correctly, so you have to keep displaying it in an HTML viewer. Worst of all, common characters like <, >, and & have to be carefully escaped because they are part of HTML itself.

    The solution is to use an HTML WYSIWYG editor, like Netscape’s Composer, Microsoft’s FrontPage, or Adobe PageMill. The problem is that many of these editors, in trying to work around some annoying browser-layout problems, emit HTML that is as complex as RTF or more so. HTML is also a little too page-agnostic, so that you can’t set tab stops at certain inch positions, or have a margin that wraps at 6 inches, or have a page break, etc. Whether you consider this an advantage or disadvantage may depend on what you’re writing and how it might be published.

  5. XML (Extensible Markup Language): This is a sibling of HTML; their common parent is SGML, or Standard Generalized Markup Language. XML leaves it up to the user to define what their tags mean. More on what this means in the next section.

My XML Experience

As a computer programmer, I have already dabbled quite a bit in XML, using it primarily as a medium of information transfer from one computer program to another. As such, it is not particularly better than other methods, with the exception that one hopes the XML content may be parsed by an increasing number of programs in the future.

But recently I had the chance to use XML for the sort of task it was originally designed to solve. At my church, I am in charge of taking the monthly newsletter and posting it on our website. This has turned out to be an amazingly tedious job, since the newsletter is not originally authored and edited in any kind of reusable electronic format. Rather, the primary purpose of the newsletter is to be printed and mailed out, in a format that roughly resembles a bulletin board, with items of different categories appearing in no particular order, interspersed with clip art, and using different fonts and formats in order to capture attention.

None of this is useful or valuable for the web version, which needs a calendar with events hyperlinked into different sections of the newsletter, accompanied by a navigation bar on the left, which allows the viewer to visit the different sections, grouped by category. The difference in format is due to the difference in use: the printed newsletter is meant to be read front to back, with little attention-grabbing details here and there. The web version is meant for quick reference.

Every month the church secretary mails me the document as a single HTML email file. The HTML is produced by Outlook Express, which is trying to preserve as much of the useless formatting as possible. I would go through this document paragraph by paragraph, cutting and pasting the text for each web-publishable item into its proper page, making up the index, and so on, a process rife with tedium and (therefore) human error.

I tried to write a Python program to parse the HTML, hoping that there might be clues in the HTML that would help me classify each paragraph. No chance. “How I wish,” I muttered to myself, “that instead of HTML tags, this document had tags that said, ‘this is a news item, this is an event, that is a date, and so on.’” No sooner had I uttered my wish that I realized that XML could grant it. Adding these tags would not be very difficult, adding them in a proofreader-like process, if I could start with simple, bare text.

That was the insight I needed. I wrote a quick Python program to strip the useless HTML of all of its useless tags, as well as transform the useless Outlook-generated HTML substitutes for smart double- and single-quotes into the clean ASCII equivalent, leaving me with what I now considered the original source document, in bare 7-bit plain ASCII. The HTML quote substitutes are something called HTML entitydefs, which look like this: #213; The number between the pound sign and the semicolon is the extended (nonstandard) ASCII number for the platform on which the HTML entitydef was generated. No good.

Next, I defined a set of XML tags that were interesting to me, and made one pass through the document, marking it up. That’s what the ML stands for in all these languages, and what their original purpose tended to be: Markup Language. Having finished this, I then wrote a Python program to parse the XML and produce a website based on it, auto-generating the indexes, navigation bars, and a calendar, based on dates parsed from the event descriptions. When I wish to improve the look of the website in the future, all I have to modify is the program; all of the new source documents, written in XML, can be “re-rendered” with the new program.

Best of all, someone who doesn’t even have access to my XML-to-website Python program can read the source document and infer its structure pretty easily. So I consider XML to be a successful solution for this kind of problem, resulting in source documents that should survive the ravages of time pretty well.

Candidates for Graphics Document Storage

It may appear that this problem has already been solved via the World Wide Web, with the widespread use of the GIF and JPEG formats. GIF is good for cartoons, line drawing, and the like; JPEG is good for photos and their kin. These formats are great for presentation, but not good for saving images in a form in which you could imagine resuming authoring and editing. What I really want in most cases is the ability to save my vector drawings in a portable vector format. PostScript again comes close, because its vector output scales very well, but there is a paucity of good WYSIWYG PostScript editors. On my Mac, I find the PICT format valuable since it retains vector data. But PICT does not make me as comfortable as PostScript: it’s not based on ASCII. Worse, it only lives on the Mac platform.

Conclusions and Recommendations

  1. All formats should be based on 7-bit ASCII. The 7-bit prejudice is a holdover from the loss of the 8th bit in transmission over certain old networks. It’s no great loss, however, because only the lower 7 bits of ASCII contain the printable characters, and that’s what we’re interested in. It’s also good if the format doesn’t depend on the exact carriage-return protocol, since this is the one thing that makes 7-bit ASCII non-portable. (NOTE that e-mail and ftp know how to keep this straight between platforms.)
  2. Postscript is good for diagrams, vector data, and printed pages. It is not good for word processing. It is possible to write PostScript, but it is page-oriented, which cramps your style: all line breaks and page breaks have to be managed by you.
  3. RTF and HTML + Unicode are probably the two best choices for word processing documents. HTML “scales better” than RTF in many ways, but is printed-page-agnostic. RTF is nice except that it isn’t nearly as platform-independent as it ought to be. RTF + Unicode would be excellent, but I can’t seem to get it to work with MacOS X’s TextEdit. No matter what I do, it only puts out the Mac-specific extended ASCII. This doesn’t mean that RTF + Unicode doesn’t exist—I just haven’t been able to try it firsthand yet.
  4. XML is an interesting choice, one that I have used before. It does require you to be a bit of an expert. The idea, with XML, is that there is no way to abstract all possible authorial intents, so XML lets you define your own semantic hierarchy. Resembling HTML in every important way, but allowing you to define your own tags (like <section name="notes">), XML might be a good answer for document storage. The problem with it is that you end up having to be enough of a computer scientist to write your own parser for it. (c.f. my own XML experience.)  Interesting side note:  when XML documents start with the line <?xml version="1.0" standalone="yes"?>, they are by definition UTF-8 encoded.

Putting It Into Practice

This web page, as well as my main page, are now written in UTF-8 Unicode + HTML, written with Netscape 7’s Composer, which, on Mac OS X is quite nice. The steps I have to go through are:

  1. When I make a new Composer document, I go through the menus View > Character Coding > Unicode (UTF-8). This places the following META tag at the top of my HTML document: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> This instructs web browsers that use it to view it with the UTF-8 encoding. As a result, I can use my smart quotes and em-dashes in this document.
  2. Under Mac OS X, I go to the International section of System Preferences and enable U. S. Extended. This allows me to choose it from the menu bar. It means that when I type my em-dashes and smart quotes and accented characters, that the Unicode character is inserted instead of the Mac extended ASCII character.
  3. When I upload the UTF-8 pages to my web server over FTP, I have to ensure that the upload is done with Binary as the mode, instead of ASCII (so that it does not assume 7-bit clean), and Raw Data, so that it does not send the Mac resource fork.
So far, so good! The only thing I wish Composer would do would be to automatically insert the smart quotes. For this page, I had to do all the smart quotes manually. For the purposes of an experiment like this I don’t mind it, but I hope my TextEdit gets better in the future.