Content
After the selection comes an extensive analysis of the material during which instructions for the digitization are established. For this, a production line will be set up within the KB to process large quantities of newspapers for digitization in an efficient manner. Subsequently, the material will be digitized; for all pages, images, text files and metadata will be created. For large projects, the KB always outsources the digitization.
Many uncertain factors play a role in the analysis of the material and the digitization; some newspapers are not allowed to leave the library, others are kept over a range of institutes within and outside of the Netherlands and are only available in a fragmentary manner. Other newspapers will require a lot of preparation due to the condition they are in. It is not yet clear how much will be digitized on site and how much externally.
The original material
The quality of the digital files is not only dependent on the quality of the digitization process. It also depends on the condition of the original newspaper. Original newspapers are often of inferior printing quality (for example the ink has seeped through the page; ‘bleeding ink’), stained and torn.
Some newspapers in collection binders have been bound so firmly in the fold that it is difficult to digitize the separate pages. These problems occur with both the digitization from microfilm and from originals. In the case of microfilm the quality of the film carrier can also influence the quality of the digital files.
Original versus microfilm
The newspapers are digitized from the original or from microfilm. Of all Dutch newspapers published between 1618 and 1995, it is estimated - based on data from the ‘Gemeenschappelijk Geautomatiseerd Catalogussysteem’ and Metamorfoze - that about 20% are available on microfilm. The quality of these microfilms is variable. The digitization from microfilm is faster and cheaper, but in general produces qualitatively less impressive digital files than digitization from the original.
At the KB a study is underway into the level of suitability of different types of microfilms for digitization and OCR. The results of this research will be made available on this page at a later stage.
Digitization and OCR
Each page requires:
- one master image file;
- one or more derived files;
- several machine-readable text files of the page;
- descriptive metadata such as title and date;
- technical metadata about the creation of the digital files; and
- structural metadata about the arrangement of the newspaper and the page.
Digitization does not only relate to the scanning of the material but also to the conversion of the image files into machine-readable text by means of Optical Character Recognition (OCR) and the addition of metadata. The better the quality of the image files of the newspaper pages, the more successful the OCR. The machine-readable text forms the basis of the project. A good OCR result increases the accessibility of the collection. Considerable attention must therefore be paid to this process.
Specifications for the image files
The quality of the image files is determined by the degree to which a scan is a faithful representation of the original. The quality is affected by various factors including bit depth, resolution, storage format and compression. Throughout the project, efforts are being made to attain measurable and 'objective' quality standards. Quality managers check the file images systematically. During consultation with the suppliers, agreements are made regarding the optimal fine-tuning and benchmarking between equipment and software.
Two types of image file are distinguished in the Databank of Digital Daily newspapers: master files and derivatives. The master files form the basis for all further processing. Derivatives are necessary for presentation on the Internet and as an 'intermediary' for the OCR.
The specifications for the master files and the derivatives are included in the specifications for the European Invitation to Tender for digitization and OCR. These specifications will also be made available on this page.
Layout analysis
The machine-readable text will be supplied in XML just like the metadata. The relationship between the different files is defined in a concordance table. The identification of newspaper headings, articles and other 'units' on a newspaper page is accomplished by means of an layout analysis. The separate elements including text blocks, images and horizontal/vertical lines will be registrated. Subsequently, OCR of the separate text blocks and an analysis of the content allows the different segments to be distinguished: articles, advertisements, captions, etcetera. Through registering the coordinates of words, and if necessary the separate symbols, search terms can be marked in a picture (‘hitterm highlighting’).
By means of segmented newspaper pages a collection can be made searchable at article level, whereas the storage of the actual data will remain at page level. The checking of the layout analysis as well as the merging of articles that are spread over several pages are important, but also labour intensive parts of the process.