Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
Optical Character Recognition (OCR) is the electronic conversion of images of text into digitally encoded text using specialized software. OCR software enables a computer to convert a scanned document, a digital photo of text, or any another digital image of text into machine-readable, searchable, retrievable, and editable data. OCR data can then be used for a variety of applications, including data extraction, data/text mining, and text-to-speech technology.
The OCR process typically involves at least three steps:
Depending on the quality of your document, you may also have to edit or "preprocess" the image to improve the quality and, thus, enable the OCR software to recognize the text more accurately. If you're working with text that the OCR software isn't equipped to recognize (handwritten or atypical typography), you might need to use language packages, patterns, and training data to supplement the software's default pattern recognition settings. And, finally, depending on the accuracy of the OCR, you may have to verify and correct ("post-process") the OCR-generated text. These steps could require a considerable amount of time and effort, depending on the quality and extent of your documents, so you will want to account for this in your process.
For more on OCR, check out our Optical Character Recognition (OCR) @ Pitt guide, which provides information and resources for the following: