2004.01.09

 

Nothing as robust as a paper archive

by Karel Thönissen

Most people are convinced of the importance of good archiving, but we seem to have forgotten the valuable lessons from our paper past now that we have entered the digital era. Paper has its own problems, its is bulky and tends to deteriorate rapidly over time, but digital bits have even more problems. Digital bits deteriorate even more rapidly. Sure, digital information does not require much shelf space and it is possible to keep the digital information of a document vital for eternity with some minor effort, but here the good news ends.

Anyone reading the newspapers carefully is aware of the problems that archives face. Paper documents are vulnerable to humidity and changing environmental conditions. Solvents in plastic sheeting, poor paper quality, etc. make that some documents sometimes less than 75 years old should be considered lost permantly. However, the situation for digital information is far more concerning.

NASA has lost the ability to read the enormous amounts of data collected during the moon missions. The tapes are still there. The quality of the carrier substrate is probably still fine. About the integrity of the bits? We do not know. The application used to interpret the data is lost permanently. Even if this application was still around, would we be able to find a 1960's mainframe computer to run the software? If we could find such a machine in a retro-computing museum, do we still know the exact settings and parameters to run the operating system and the applications? Can we still find qualified personnel to run the machine and all the software?

Everyone owning a simple PC since the late eighties will have experienced problems with keeping his digital documents alive. The fast developments in computing are a mixed blessing: vintage machinery has ended on the scrapheap, byte rot has corrupted the data, the license for the applications and the OS have expired, backward compatibility of the OS is only marginal, and modern machines cannot read the ancient media. Valuable information is buried inside proprietary formats, in files that are scattered all over the machine or over the local area network. Everything is somewhere on the machine, nobody knows where and how. But the one thing that is sure is that the information is tossed out of the window when the computer is ultimately scrapped together with the hard disks that contain the documents.

It seems as if the bulkiness of paper is exactly the reason why this problem is less pressing for paper. Paper by and large is not a machine-readable medium, it is intended to be read by people. Therefore, information is encoded with little dots of black ink on white paper. There is no race to store more and more information on the paper, because there is only so much our eyes can see. Elaborate encryption and compressions algorithms also miss the point, because Moore's Law does not apply to the human brain, and we are stuck with a device for which development has stopped more or less thousands of years ago. Moreover, natural language is fairly robust to change. There are no revolutionary changes in natural languages. Sure I would have a hard time reading 12th century Dutch, but there are enough documents between the 12th century and modern Dutch to create a sort of transition path between the two versions of the language. As a consequence, natural language on plain old paper is enormously robust to deterioration.

From an archiving point of view, even its bulkiness has two beneficial properties: We hardly never throw away paper documents by accident, i.e. they do not disappear by themselves. The law of conservation of matter still applies to paper documents. It takes an idiot (unfortunately I know some sad cases of these), a fire, or a flood to destroy an archive. That is when the information has become old and valuable. The special features of paper also make archiving easier when the documents are relatively young: since paper takes space, we must throw away most of the documents we handle. However, we do this when we still know whether the document is relevant or not. For digital documents this sifting is often postponed for too long, so that in practice this decision is postponed indefinitely. This does not make for a useful archive of knowledge, this is a bit dump, only relevant for forensic investigators with enough time on their hands to read everything..