Providing access in order to offer an optimal search-and-retrieval functionality for the end-user has so far focused mainly on the quality of the metadata. Descriptive metadata offer the opportunity to search for title, author, date, etc. To be able to search through huge quantities of pages additional solutions are required. With software the digital images can be converted into machine-readable text. The common term for this process is Optical Character Recognition (OCR).
Optical Character Recognition (OCR)
The quality of the OCR is – among others – determined by the quality of the image files, the condition of the source materials and the spelling of the original text. Especially in case of older materials only a specific percentage of the characters will be recognized correctly by the OCR software, which means that full-text searching in this case will only produce limited results. Also texts with historical spelling variants complicate the OCR. There are different ways to improve the OCR, for instance by applying solutions for spelling variants, by automatic classification and by summarization of the texts.
A project that is closely involved in methods to improve the OCR of historical texts is IMPACT (IMProving ACces to Text).
Techniques to provide access
If historical spelling variants can be identified, a user can search for a word in modern spelling (‘mens’) and retrieve results that include the same word in older spelling variants (‘mensch’). This means a considerable improvement of the searchability of historical texts.
Automatic classification automatically divides up a text into specific, predetermined classes (categories) that are content – and subject based. Newspaper items for instance can be classified as belonging to politics, sport, culture or as news item, family notice, advertisement, etc.
By automated summaries the searchability of texts can be increased. A user can quickly evaluate the contents of a text, it makes automatic classification easier and it can elevate the ranking of search results.
The techniques for improved access are still very much under development.
Page lay-out
In order to be able to retrieve a specific article from a newspaper that is spread out over more than one page, the lay-out of a page needs to be reconstructed. For this purpose the XML standard ALTO was developed. After segmenting the text in separate parts with specialized software, the lay-out is stored in ALTO. After that ALTO can be used to rebuild the lay-out of a page and also to create new derivatives like for instance PDF-files any time. ALTO is widely applied with newspaper digitization.