Skip to Main Content

Course & Subject Guides

Optical Character Recognition (OCR) @ Pitt

This LibGuide introduces users to optical character recognition (OCR), outlines OCR best practices, provides information and resources for OCR tools, and links to example OCR projects.

Introduction

Optical Character Recognition (OCR) is the electronic conversion of images of text into digitally encoded text using specialized software. OCR software enables a computer to convert a scanned document, a digital photo of text, or any another digital image of text into machine-readable, searchable, retrievable, and editable data. OCR data can then be used for a variety of applications, including data extraction, data/text mining, and text-to-speech technology. 

 

OCR Workflow

 

The OCR process typically involves at least three steps:

  1. Scanning and/or opening a document in the OCR software,
  2. Recognizing the text in the document using the OCR software, and 
  3. Saving the new OCR-processed document in the file format of your choosing.

OCR Workflow

Depending on the quality of your document, you may also have to edit or "preprocess" the image to improve the quality and, thus, enable the OCR software to recognize the text more accurately. If you're working with text that the OCR software isn't equipped to recognize (handwritten or atypical typography), you might need to use language packages, patterns, and training data to supplement the software's default pattern recognition settings. And, finally, depending on the accuracy of the OCR, you may have to verify and correct ("post-process") the OCR-generated text. These steps could require a considerable amount of time and effort, depending on the quality and extent of your documents, so you will want to account for this in your process. 

OCR in the Library and at Pitt

Pitt and the Library provide access two of the most sophisticated OCR software programs out there: ABBYY FineReader and Adobe Acrobat Pro DC.

 

OCR Workstation

 

An OCR workstation with ABBYY FineReader 14 is located in the Digital Scholarship Commons, just outside the Digital Stewardship Lab (Ground Floor of Hillman Library). This workstation is available to any member of the University of Pittsburgh community at any time Hillman Library is open. The device can also be reserved online.

 

Adobe Acrobat Pro DC

 

Adobe Acrobat Pro DC is installed on most computers in the Student Computing Labs across campus as well as in the specialized section of the Pitt IT Virtual Computing Lab, and can be used for recognizing text in PDF documents. Acrobat, among other Adobe products, is also available to Pitt students and faculty to install for free. For instructions to get access, see https://www.technology.pitt.edu/software/adobe-software.

This Guide

This guide provides information and resources for the following: