Introduction

The KB’s e-Depot guarantees secure long-term storage of digital material. However, long-term accessibility is another matter. Research has shown that the storage format in particular – the structure in which the data of a digital object is stored on the carrier – is a point of concern. Formats are complex and not based on open standards or specifications. As the KB aims to preserve the original object, only a limited number of strategies can be applied.

Together with IBM Netherlands, the KB has developed a new preservation strategy, based on the Universal Virtual Computer (UVC). With the UVC it is possible to read files without adapting them and without the original hardware or software. JPEG images can now be viewed independent of changes in technology. Afterwards, the method was extended for GIF images as well. The UVC project took place between September 2003 and April 2004.

How it works

Every computer file can be revived with the UVC based preservation method. Text documents, sound samples, images, spreadsheets or videos can all be reconstructed if a UVC is available for that particular type of format. The concept of the UVC was developed by IBM researcher Raymond Lorie. Based on this concept a genuine UVC was developed, consisting of four components (see figure 1):

  1. Universal Virtual Computer (UVC)
  2. Format decoder
  3. Logical Data Schema (LDS)
  4. Viewer



Figure 1: components of the UVC

In analogy with today’s computer architecture, the UVC is a virtual representation of a simplified computer. Due to its simplicity, the UVC can in fact be made to work on any conceivable computer system. Basically, an extra layer on top of ever changing hardware and software is created, which offers a stable platform to UVC programmes. Detailed instructions will enable future developers to rebuild a UVC at any time.

The format decoder is a programme developed for the UVC, by which a particular file format can be deciphered into a so-called Logical Data View (LDV). This LDV describes in detail how the digital object is structured. For instance, raster-based images are described pixel by pixel, whereas spreadsheets are defined by their cells and formulas. For each file format a UVC format decoder is required. Once a decoder is available however, it can be used for all files in the same format. It is important that decoders are developed for formats when they are still in use, to compare the quality of the format and the decoded version. Therefore it is important to start developing decoders now for formats at risk of becoming obsolete.

The third UVC component is a Logical Data Schema (LDS), that determines which elements can occur in a particular format and how these are related. For instance, raster-based images are defined by pixels and each pixel is composed of red, green and blue. Furthermore, the LDS contains information on the semantics of the different elements. What exactly is the colour blue and how can this meaning be captured so future users can see the authentic colours? This type of information is described in an LDS, one of which has to be made for every file format.

Finally, the LDV is translated into an understandable representation by a viewer, the fourth and final component of the UVC-based preservation method. Figure 2 depicts this process schematically. First of all, an unknown digital document is deciphered by a decoder that runs on the UVC, which results in a Logical Data View (LDV). Secondly, using a viewer and the Logical Data Schema (LDS), the LDV is translated into a representation.


Figure 2: the UVC-based preservation method for digital objects

The KB solution

At the moment, publications in Portable Document Format (PDF) make up the greater part of the publications stored in the e-Depot of the KB. Because of this, and based on the outcomes of the ‘proof of concept’ of the Long-Term Preservation (LTP) study by the KB and IBM, the primary goal of this project was to create a safety-net solution for PDF publications and to develop a UVC, LDS and viewer. As PDF is a complex format and the project’s time span was limited, it was decided not to develop a format decoder but to introduce an intermediate step: the preservation processor.

This preservation processor converts a PDF file into a series of JPEG images, each image representing one page of the original PDF on 300 dpi (dots per inch). In this manner, every JPEG image can be reconstructed using the UVC and a JPEG decoder. Although some of the original aspects of PDF publications are lost when using this method, it is the only method so far that guarantees long-term accessibility of PDF files. The choice for JPEG as a target format also means that this approach can be applied to every other current format, since every format can be transformed into JPEG. In figure 3 all steps for this PDF-to-representation approach are shown.

Figure 3: from PDF to representation

Apart from JPEG, a format decoder was also developed for images stored in Graphics Interchange Format (GIF). The next step will be to extend the UVC-based preservation method to Tagged Image File Format (TIFF) and PDF as well. Figures 4 and 5 show a reconstruction of a JPEG image using the UVC-based preservation method.

 - Klik voor een uitvergroting
Figure 4
 - Klik voor een uitvergroting
Figure 5

Findings

Evaluation of the method shows the UVC to be a promising technique. JPEG and GIF images can be reconstructed in the future with the UVC-based preservation method. However, the method needs to be elaborated. Decoders, LDS and viewers must be developed to make the UVC suitable for a wider range of digital material. In short, the method has the following advantages:

  • Long-term access is guaranteed
  • The original document is preserved
  • The method is hardware and software independent
  • No periodic actions are required (unlike migration)
  • Efficient: one decoder for every format type

Future work entails performance improvement, the development of decoders, LDS and various viewers and developer support for developing programs for the UVC platform. The KB plans to take up these activities in the years to come, if possible in cooperation with other institutions.

Test it yourself

The UVC-based preservation method can be downloaded free of charge from the IBM Alphaworks website: http://www.alphaworks.ibm.com/tech/uvc

The download package contains a UVC developed in Java; two image format decoders: JPEG and GIF; an image viewer (with a build-in LDS for images); a general LDV viewer (with a build-in LDS for images) and a set of test images in JPEG and GIF format.

The UVC project was carried out by IBM Netherlands and the Koninklijke Bibliotheek, National Library of the Netherlands, with support from IBM Research Center Almaden, USA.

Further reading

Articles


Presentations

Contact

For more information please contact Hilde van Wijngaarden