Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
"Optical Character Recognition (OCR) @ Pitt" by University of Pittsburgh Library System is licensed for reuse under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Alpert-Abrams, Hannah. Transcribing Multilingual and Historical Documents. University of Texas at Austin, 2018. [PDF]
Workshop materials covering manual and automatic transcription.
Mandell, Laura. Early Modern OCR Project. Texas A&M University.
The Early Modern OCR Project (EMOP) is an effort, on the one hand, to make access to texts more transparent and, on the other, to preserve a literary cultural heritage. The printing process in the hand-press period (roughly 1475-1800), while systematized to a certain extent, nonetheless produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink (among many other variables). Combining these factors with the poor quality of the images in which many of these books have been preserved (in EEBO and, to a lesser extent, ECCO), creates a problem for Optical Character Recognition (OCR) software that is trying to translate the images of these pages into archiveable, mineable texts. By using innovative applications of OCR technology and crowd-sourced corrections, eMOP will solve this OCR problem.
Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros. University of Texas at Austin, December 2017.
Reading the First Books: Multilingual, Early-Modern OCR for Primeros Libros is a two-year, multi-university effort to develop tools for the automatic transcription of early modern printed books. It is a collaboration between students, faculty, and staff at the University of Texas at Austin and Texas A&M University. The goal of the “Reading the First Books” project is to design and implement tools for the transcription of the Primeros Libros collection of books printed in New Spain in the sixteenth century.
Smith, David A., and Ryan Cordell. A Research Agenda for Historical and Multilingual Optical Character Recognition. Northeastern University, 2018.
In this report, David A. Smith and Ryan Cordell of Northeastern University’s NULab for Texts, Maps, and Networks survey the current state of OCR for historical documents and recommend concrete steps that researchers, implementors, and funders can take to make progress over the next five to ten years. Advances in artificial intelligence for image recognition, natural language processing, and machine learning will drive significant progress. More importantly, sharing goals, techniques, and data among researchers in computer science, in book and manuscript studies, and in library and information sciences will open up exciting new problems and allow the community to allocate resources and measure progress.
National Library of Luxembourg has an Open Source Optical Character Recognition viewer. National Library of Luxembourg, 2021.
National Library of Luxembourg generates transcriptions for their newspaper collections with Kraken, achieving 96% accuracy on their test set, then uses Kraken to build a new OCR tool called Nautilus-OCR.
Doughman, Jad, et al. Time-Aware Word Embedding for Three Lebanese News Archives. American University of Beirut, 2020. [PDF]
Scholars at the American University of Beirut used Kraken to OCR Arabic script in Lebanese newspaper archives.
Wick, Christopher, et al. Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus. Journal for Language Technology and Computational Linguistics, 2018. [PDF]
In this report, Wick and others use Calamari OCR to implement a CNN- and Pooling-layer when training models for transcriptions of old English texts, resulting in a lower CER.
Gabay, Simon, et al. OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more). Journal of Data Mining and Digital Humanities, 2020. [PDF]
Gabay and others compare OCR accuracy for 17th c. French prints using Calamari and Kraken OCR.