Guides: Optical Character Recognition (OCR) @ Pitt: Programmatic OCR Tools

Programmatic Tools

Programmatic tools tools require at least some programming knowledge. Depending on the tool and the proficiency of your coding skills, you may be able to customize the OCR functionality more than with out-of-the-box tools. The following recommended tools vary by type (e.g., JavaScript scripts, Python module, Python scripts, Python wrapper) and may or may not be compatible with your platform (operating system). All tools are freely available.

Tesseract

Tesseract is an open source OCR software and can be used directly via command line, or (for programmers) by using an API, to extract printed text from images. Tesseract doesn’t have a built-in GUI (Graphic User Interface), but there are several available from the 3rdParty page. The engines include a neural net (LSTM) based OCR engine, which is focused on line recognition, as well as an engine that works by recognizing character patterns. Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box". Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.

Type: Command-line program
Batch Processing: Yes
Helpful Resource(s):
- tessdoc. “Tesseract User Manual.”
- Wolf, Nick. “Research Guides: Tesseract OCR Software Tutorial: Home.”

Kraken OCR

Kraken OCR is a command-line Python package that generates transcriptions for historical documents in a variety of languages. Kraken can train models to generate transcriptions for Latin scripts and non-Latin scripts (e.g., Haskh, Aramaic, Devangari), as well as texts written right-to-left and top-to-bottom. The Kraken package provides free public models available for users to run on their documents.

Type: Python package
Tested/Compatible Platform(s): Linux, macOS X

Calamari OCR

Calamari is a Python based OCR package built upon OCRopy and Kraken. Calamari specializes in generating transcriptions for early printed texts but performs well on contemporary texts, too. Calamari uses confidence voting and model pretraining, resulting in low a CER. Calamari does not perform layout analysis or line segmentation, and those tasks will have to be performed separately. Its focus is on transcribing line images to text. Calamari can, however, be integrated with other Python-based OCR programs like (Kraken, pyocr, OCRopy, etc.) to complete all stages of the OCR process from image preprocessing, through training, to transcription.

Type: Python package
Tested/Compatible Platform(s): Windows, Mac OS, Linux

Nautilus-OCR

Nautilus-OCR is an open-source, Python-based OCR engine developed at the National Library of Luxembourg. Nautilus-OCR works with the METS/ALTO schemas, with the ability to take in a METS/ALTO dataset and produce an improved METS/ALTO dataset. The National Library of Luxembourg used Nautilus-OCR on their historical newspaper collection and published the OCR models they produced on that project for public use with Nautilus-OCR.

Type: Python package
Tested/Compatible Platform(s): Linux, Mac OS

PyOCR

PyOCR is a Python wrapper for Tesseract and Cuneiform, which simplifies the use of these OCR tools.

Type: Python wrapper
Tested/Compatible Platform(s): GNU/Linux, *BSD (probably), macOS (maybe), Windows (maybe)

Neural Network OCR

Neural Network OCR trains a multi-layer perceptron (MLP) neural network to perform OCR. The training set is automatically generated using a heavily modified version of the captcha-generator node-captcha. It also supports MNIST handwritten digit database.

Type: JavaScript scripts
Tested/Compatible Platform(s): macOS

doc2text

doc2text was created to help researchers fix common errors in poorly scanned PDFs and extract the highest quality text from their pdfs as possible. It can detect text blocks and OCR poorly scanned PDFs in bulk.

Type: Python module
Tested/Compatible Platform(s): Ubuntu

tesserocr

tesserocr is a Python wrapper for the Tesseract API.

Type: Python wrapper
Tested/Compatible Platform(s): *BSD, Debian, Linux, macOS, Ubuntu, Windows

University of Pittsburgh Library System

Course & Subject Guides

Optical Character Recognition (OCR) @ Pitt

Get Help with OCR

Guide Contributors

License