Skip to Main Content

Course & Subject Guides

Optical Character Recognition (OCR) @ Pitt

This LibGuide introduces users to optical character recognition (OCR), outlines OCR best practices, provides information and resources for OCR tools, and links to example OCR projects.

Programmatic Tools

Programmatic tools tools require at least some programming knowledge. Depending on the tool and the proficiency of your coding skills, you may be able to customize the OCR functionality more than with out-of-the-box tools. The following recommended tools vary by type (e.g., JavaScript scripts, Python module, Python scripts, Python wrapper) and may or may not be compatible with your platform (operating system). All tools are freely available.

 

Tesseract

 

Tesseract is an open source OCR software and can be used directly via command line, or (for programmers) by using an API, to extract printed text from images. Tesseract doesn’t have a built-in GUI (Graphic User Interface), but there are several available from the 3rdParty page. The engines include a neural net (LSTM) based OCR engine, which is focused on line recognition, as well as an engine that works by recognizing character patterns. Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box". Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV.

Kraken OCR

 

Kraken OCR is a command-line Python package that generates transcriptions for historical documents in a variety of languages. Kraken can train models to generate transcriptions for Latin scripts and non-Latin scripts (e.g., Haskh, Aramaic, Devangari), as well as texts written right-to-left and top-to-bottom. The Kraken package provides free public models available for users to run on their documents.

  • Type: Python package
  • Tested/Compatible Platform(s): Linux, macOS X

 

Calamari OCR

 

Calamari is a Python based OCR package built upon OCRopy and Kraken. Calamari specializes in generating transcriptions for early printed texts but performs well on contemporary texts, too. Calamari uses confidence voting and model pretraining, resulting in low a CER. Calamari does not perform layout analysis or line segmentation, and those tasks will have to be performed separately. Its focus is on transcribing line images to text. Calamari can, however, be integrated with other Python-based OCR programs like (Kraken, pyocr, OCRopy, etc.) to complete all stages of the OCR process from image preprocessing, through training, to transcription.

  • Type: Python package
  • Tested/Compatible Platform(s): Windows, Mac OS, Linux

 

Nautilus-OCR

 

Nautilus-OCR is an open-source, Python-based OCR engine developed at the National Library of Luxembourg. Nautilus-OCR works with the METS/ALTO schemas, with the ability to take in a METS/ALTO dataset and produce an improved METS/ALTO dataset. The National Library of Luxembourg used Nautilus-OCR on their historical newspaper collection and published the OCR models they produced on that project for public use with Nautilus-OCR.

  • Type: Python package
  • Tested/Compatible Platform(s): Linux, Mac OS

 

PyOCR

 

PyOCR is a Python wrapper for Tesseract and Cuneiform, which simplifies the use of these OCR tools.

  • Type: Python wrapper
  • Tested/Compatible Platform(s): GNU/Linux, *BSD (probably), macOS (maybe), Windows (maybe)

 

Neural Network OCR

 

Neural Network OCR trains a multi-layer perceptron (MLP) neural network to perform OCR. The training set is automatically generated using a heavily modified version of the captcha-generator node-captcha. It also supports MNIST handwritten digit database. 

  • Type: JavaScript scripts
  • Tested/Compatible Platform(s): macOS

 

doc2text

 

doc2text was created to help researchers fix common errors in poorly scanned PDFs and extract the highest quality text from their pdfs as possible. It can detect text blocks and OCR poorly scanned PDFs in bulk. 

  • Type: Python module
  • Tested/Compatible Platform(s): Ubuntu

 

tesserocr

 

tesserocr is a Python wrapper for the Tesseract API.

  • Type: Python wrapper
  • Tested/Compatible Platform(s): *BSD, Debian, Linux, macOS, Ubuntu, Windows