By using special software, we can now convert scanned texts into so-called computer-readable or machine-readable text, which is easier for users to search. This process is known as Optical Character Recognition (OCR). What makes OCR even more remarkable is that it enables people and computers to read scans. The computer automatically recognises the words on the scan, which enables users to search the text.
What is OCR?
The KB has digitised a large number of historical texts to make them accessible and preserve them for a wide audience. The first step in this process is to create a digital scan, which is essentially a photograph of the text. After scanning the pages of a book, for example, we can then view the pages as images on a computer. While these scans are perfectly readable for humans, the same cannot be said of computers.
Making scans computer-readable
We use a technique called Optical Character Recognition (OCR) to convert texts into documents that can be easily read by computers. OCR software can recognise:
- the location of text in an image file
- the constituent letters of the text
- whether the text on the page is divided into columns (as it often is in newspapers) or paragraphs
- graphical elements such as images or illustrations
By recognising these elements, the software can convert the text on a scan into computer-readable text.
Limitations of OCR
OCR quality depends on multiple factors, including:
- The quality of the image file (the scan). OCR software can struggle to recognise text on low-quality scans.
- The quality of the source material (e.g. a book). For example, the software may find it harder to recognise text on damaged pages or distinguish letters in older materials.
- The spelling of the original text. Words may be spelt differently in old texts than they are today.
Improving OCR
There are several ways to improve OCR, including solutions that detect spelling variations or by using crowdsourcing, which involves volunteers manually correcting texts. Developers have also created self-learning OCR software.
Why is OCR important?
Computer-readable texts have 3 major advantages.
1. Fully searchable texts
OCR makes digital texts fully searchable so that readers no longer need to rely on metadata or things such as newspaper headlines alone. OCR software can also recognise historical spelling variants, so users can search for words in modern spelling (e.g. 'town') and find results that use an older variant (e.g. 'toune'), greatly improving the searchability of historical texts.
2. Automatic categorisation and summaries of texts
By analysing the genre and main topic of scanned texts, software can automatically classify texts into one or more predefined categories. Newspaper reports, for example, can be classified as news items, family announcements, ads, etc. or by topics such as politics, sports or culture.
We can also use software to automatically summarise computer-readable texts, which can help users quickly determine whether a text is relevant to them. Summaries also make texts easier to classify, which in turn leads to improved search result sorting.
3. OCR facilitates large-scale research
Computer-readable texts allow researchers to conduct large-scale research on historical topics, such as the Spanish flu. Because researchers no longer have to painstakingly comb through all available sources manually, they can include larger amounts of text in their research.
How does the KB use OCR?
Much of the collection on the Delpher and Digital Library for Dutch Literature (DBNL) websites has already been digitised, but we are still working on digitising more material.
OCR on DBNL
OCR quality on DBNL is very high because we manually check and correct all the texts. As a result, texts published on DBNL are 99.995% correct.
OCR on Delpher
OCR quality on Delpher is slightly patchier. For example, the scan quality of several historical newspapers is rather poor, which makes producing high-quality digitised texts more challenging. We are currently working on monitoring, checking and possibly improving OCR quality on Delpher.