What will you be doing?
The digitized collection of the KB National Library of the Netherlands is mostly available through the Delpher portal see https://www.delpher.nl. All scanned images of texts have been processed with the optical character recognition (OCR) software ABBYY. However, this software does not always perform very well for historical material and the OCR quality of Delpher is not as high as its users would like it to be.
Over the past year the KB has produced 2000 pages of manually corrected newspaper pages for the development and testing of a machine learning tool to post-correct OCR. However, this ground-truth data can also be used for the evaluation of other existing OCR and/or HTR (handwritten text recognition) engines. Next to this, we have around 4500 pages of ground-truth from books, parliamentary papers and newspapers from an earlier project http://lab.kb.nl/dataset/ground-truth-impact-project.
During this research internship or graduation assignment we would like to examine the possibilities and outcomes of current OCR/HTR engines on the Delpher corpus. This internship could also include participation in the ICT With Industry Workshop in January 2020 where the KB will be working on OCR for Gothic texts.
This consists of two parts:
1. Training and running OCR/HTR engines
Many of the engines require training for them to adapt to the material type and language. Given that the Delpher corpus is quite large it might be needed to train models for different time periods. However, we would also be interested to know what the difference is between using time specific models and a more generic model.
2. Evaluating and comparing the output of different engines
This internship results in evaluation report comparing the used engines and their results. The engines we are interested in are at least:
The intern is welcome to use more engines or combine processes to improve results.
We are looking for a student who
- is at the final stage of his or her study Software Engineering, Computer Science, Artificial Intelligence, Data Science, (Digital) Humanities or related
- can work technically independently, but with substantive support of our Data Science and Digitisation Team
- can handle existing tooling, or knows how to gain knowledge about this
- has expertise in the field of NLP, machine learning and statistical models, for example for the evaluation of output
- has basic knowledge of Dutch language. Although this is not a necessity, some understanding will be helpful
- A research internship at the Data Science Team of the Research Department of the National Library of the Netherlands (18 fte) for max 6 months, but all catered to your needs and requirements of your university and supervisor
- A working place at the offices of the KB, downtown The Hague, only a 3 minute walk from Central Station
- Substantive support by both our Data Science Team as well as the Digitisation Team
- Access to all ground-truth data of the KB, tooling we developed before see and hardware we have available to run our own experiments
- Reimbursement for travel costs and a compensation, in line with our regular internship compensation.
Who are we?
The KB is a nationally and internationally renowned institution: with more than 500 employees, we are one of the major Dutch heritage and science institutions and have an important coordinating role in the network of public libraries. Tasks include preserving, collecting and making available all publications published in or about the Netherlands and building the national digital library. We also think it is important to train young colleagues.
We regularly have internships for students of various courses and disciplines, both academic and higher professional (eg book science, literature study, Artificial Intelligence, Data Science, Software Engineering, (Digital) Humanities, IT, HRM, financial, facility management, communication etc.). For example, we assist HBO students in their work placement, but also academics who want to carry out their (graduation) research or graduation project at the KB.