The peculiarities of early printing technology make it difficult to convert early modern texts (before 1800) into digital text using Optical Character Recognition (OCR) software. Over the next two years, the OCR’ing Early Modern Texts (eMOP) project, in which the KB participates, will work on improving this situation.
Within eMOP two open source OCR engines (Tesseract and Gamera) will be adapted for the recognition of historic fonts, the improvement of line segmentation and for evaluation. In addition, tools will be developed for the adequate transcription of early modern texts through cutting-edge crowdsourcing technologies. Through the 18thConnect portal, eMOP will offer access to a large number of documents with improved OCR from the collections Early English Books Online (EEBO) en Eigtheenth Century Collections Online (ECCO), multiple tools for crowdsourcing for OCR postcorrection and an integrated workflow of all tools that have been developed within the project.
The project started on 1 October 2012 and has a duration of two years. It is funded by the Andrew W. Mellon Foundation and led by the Initiative for Digital Humanities, Media and Culture (IDHMC) at the Texas A&M University. The KB National Library of the Netherlands is one of the six partners in the project and will, over the next years, support the evaluation process and work on the integration of tools in the Taverna workflow system.
More information is available through the website: http://emop.tamu.edu/
- Texas A&M University, Initiative for Digital Humanities, Media, and Culture (IDHMC) – Projectleider
- Performant Software Solutions
- University of Illinois Urbana-Champaign, Software Environment for the Advancement of Scholarly Research (SEASR)
- University of Massachusetts Amherst, Center for Intelligent Information Retrieval (CIIR)
- Koninklijke Bibliotheek
- University of Salford Manchester, Pattern Recognition & Image Analysis Research (PRIMA)