2005.03.08

 

The Battle between Archivists and Publishers

by Karel Thönissen

Yesterday I had a discussion with Peter Walgemoed of Carelliance about archiving and the different requirements of archivists and publishers. Both groups are proponents of storing the information that is produced in institutions, but for very different reasons.

Archivists want to store the original documents that were produced by the organisation so that the documents can be retrieved in the future and be used for historic research. For example, archives are used for proving a legal position, for scientific research of cases, or for historic research in the narrower sense. For this to be possible, they have learnt that it is important to keep originals or copies that contain as much information of the originals as possible. Digital copies are accepted and introduced, (exaggerating) for no other reason than the reduction of size. Given the purpose of a traditional archive, archivists would really love to keep everything and not make selections. Historically, this was prevented by the costs of physical storage, but the digital revolution has changed all that. Sure, digital archives add other advantages such as speed of retrieval and sometimes advanced searching, but these are less important given the goals of archiving: keeping documents for our descendants. Given the time scale, speed of retrieval is a lesser issue.

Understanding this goal of archiving, it should not come as a surprise that the favourite document format is a bitmapped copy of the document using a lossless compression algorithm, or preferably, no compression at all. The problem with other document formats and compression algorithms is that it cannot be guaranteed that the software and machinery needed to view the document will still be around and operable in >10 years from now. Compressed bitmaps are the best guarantee in this respect. Using uncompressed bitmaps reduces preservation to a much simpler problem: make sure that the bits do not deteriorate and always keep the documents on a current medium. It is a simple problem of controlled copying.

Often the same institutions that have large archives publish information. Think of academic hospitals, universities, governments, etc. The publishers in the organisation are also interested in the keeping of information, but from a totally different angle. Original content is a source of income. Therefore, the content must be preserved. The same content can be used for articles, books, web pages, newspapers, CDs, DVDs, etc. However, the original format is often not important. Searchable text is worth much more than bitmapped facsimiles. To put it bluntly: the archivist is interested in format, the publisher in content.

They live together in the same institution, the archivist and the publisher. Maintaining a historic archive and a content management system is too expensive, because all material then must be selected, filtered, classified, etc. twice. So from management there will be a strong push towards a single system. Who will win this battle? The archivist with his historically precise system based on format, or the publisher with his content management system based on the more malleable text?

The obvious answer is: the publisher. He is bringing in the money, the archive is only costing money. Archives can be commercially exploited, e.g. newspaper archives, but that is exactly what publishers do, not archivists. For these historic documents to be valuable it is essential that they can be searched, published on various media including the web), referred to from electronic document e.g. web pages, etc.

Not that I am proposing to abolish traditional archiving. Far from that; in an earlier article I even presented the provoking idea that paper archives are better than digital archives. I am not taking a position, just explaining the forces at work and my expectation how things will develop. Maybe there is a way out of this difficult dilemma between historically justified archives and commercially justifiable content management systems, but such a solution is not simple. There are very difficult problems to be solved here that are completely overlooked. It certainly is not just a matter of adopting XML/XSD and similar techniques. I shall return to this subject later...