Search & retrieval

The search-and-retrieval covers all activities and technologies with which the digitized newspapers are made available and accessible for the user. The wishes of the future end users of the Databank of Digital Daily newspapers will be summarized by Digital Archiving and Networked Services (DANS). The outcome of this inventory will serve as the guiding principle for the search-and-retrieval functionality in the Databank. The report of this inventory will be made available on this page at a later stage.

For the search-and-retrieval indexes will be made of:

  1. the full text so that each word in the text can be retrieved; and
  2. the descriptive metadata, so that one can search for either the title of the newspaper, the date and/or the heading of the article.

The descriptive metadata contain:

  • Per newspaper title: the title, the year in which the newspaper title was published, the predecessors and successors of the newspaper title and geographic information about the location where the newspaper was published
  • Per issue (number of a newspaper): the date and the edition
  • Per page: the page number
  • Per article: the heading and the type of article

For the opening up of collections the KB works to a fixed infrastructure. This infrastructure makes use of open standards and proven working methods:

  • For the descriptive metadata Dublin Core is used. Own elements can be necessary if the specific metadata do not occur in the Dublin Core. The descriptive metadata are stored in XML format.
  • MPEG21-DIDL is used for the structural metadata. These metadata record the hierarchical relationships present within the material: for example, a newspaper edition consists of pages and each page contains articles. MPEG21-DIDL also records those images and text files related to one another.
  • The page layout (the different zones, such as images, columns and headings) is stored with the help of a segmenting standard ALTO.
  • All files will be made accessible with persistent URLs. This means that the URL does not change if the physical storage location of the file changes. For these persistent URLs, a resolver (pdf) will be used that translates each URL to the physical file location and transmits the file requested to the user.
  • The indexing of the text files and of the descriptive metadata will be done by a K2 search engine of Verity.
  • For search queries made from a web application, the SRU-protocol, will be used. With this the search queries can be included in a URL in a standard manner.

Improved searchability

OCR technologies for historic newspaper material are developing rapidly. The pilot project ‘Historische kranten in beeld’ generated a word accuracy of about 60-70% 1. It is expected that the results for newspapers from the seventeenth and eighteenth centuries will be worse due to the poor quality of the material, other fonts and historic spelling 2. The large quantity of files makes manual correction of the text unfeasible. Therefore automatic processing techniques will have to be used.

With two external partners, Irion and Tilburg University, KB is investigating methods and technologies to improve the OCR results of historic texts 3. Research is done on how to improve the OCR-results with the help of historical lexicons. Also technologies for automatic classification and auto-summarization are explored. Thesauri, synonym lists and spelling checkers can improve the searchability of the initial OCR. As an unwritten rule possibilities to enhance access to poor machine-readable text are often quite limited 1.

1. A. Verheusen en R. Zaat, ‘Tekstretrieval in krantencollecties’ in: Informatie Professional (2004), 11. URL: <http://igitur-archive.library.uu.nl/DARLIN/2005-0526-202104/VerheusenIP112004.pdf>
2. Over 17e eeuwse kranten, see: R. Vos, ‘Oudste kranten vind je nu ook in Nederland’ in: Persmuseum Nieuws 3 juni 2003, URL: <http://www.persmuseum.nl/pdf/PM3.pdf>
3. Irion (URL: <http://www.irion.nl/>) and the ILK (Induction of Linguistic Knowledge, Universiteit Tilburg, URL: < http://ilk.uvt.nl/ >)