Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
"Optical Character Recognition (OCR) @ Pitt" by University of Pittsburgh Library System is licensed for reuse under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Programmatic tools tools require at least some programming knowledge. Depending on the tool and the proficiency of your coding skills, you may be able to customize the OCR functionality more than with out-of-the-box tools. The following recommended tools vary by type (e.g., JavaScript scripts, Python module, Python scripts, Python wrapper) and may or may not be compatible with your platform (operating system). All tools are freely available.
Tesseract is an open source OCR software and can be used directly via command line, or (for programmers) by using an API, to extract printed text from images. Tesseract doesn’t have a built-in GUI (Graphic User Interface), but there are several available from the 3rdParty page. The engines include a neural net (LSTM) based OCR engine, which is focused on line recognition, as well as an engine that works by recognizing character patterns. Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box". Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.
Kraken OCR is a command-line Python package that generates transcriptions for historical documents in a variety of languages. Kraken can train models to generate transcriptions for Latin scripts and non-Latin scripts (e.g., Haskh, Aramaic, Devangari), as well as texts written right-to-left and top-to-bottom. The Kraken package provides free public models available for users to run on their documents.
Calamari is a Python based OCR package built upon OCRopy and Kraken. Calamari specializes in generating transcriptions for early printed texts but performs well on contemporary texts, too. Calamari uses confidence voting and model pretraining, resulting in low a CER. Calamari does not perform layout analysis or line segmentation, and those tasks will have to be performed separately. Its focus is on transcribing line images to text. Calamari can, however, be integrated with other Python-based OCR programs like (Kraken, pyocr, OCRopy, etc.) to complete all stages of the OCR process from image preprocessing, through training, to transcription.
Nautilus-OCR is an open-source, Python-based OCR engine developed at the National Library of Luxembourg. Nautilus-OCR works with the METS/ALTO schemas, with the ability to take in a METS/ALTO dataset and produce an improved METS/ALTO dataset. The National Library of Luxembourg used Nautilus-OCR on their historical newspaper collection and published the OCR models they produced on that project for public use with Nautilus-OCR.
PyOCR is a Python wrapper for Tesseract and Cuneiform, which simplifies the use of these OCR tools.
Neural Network OCR trains a multi-layer perceptron (MLP) neural network to perform OCR. The training set is automatically generated using a heavily modified version of the captcha-generator node-captcha. It also supports MNIST handwritten digit database.
doc2text was created to help researchers fix common errors in poorly scanned PDFs and extract the highest quality text from their pdfs as possible. It can detect text blocks and OCR poorly scanned PDFs in bulk.
tesserocr is a Python wrapper for the Tesseract API.