Skip to Main Content

Course & Subject Guides

Optical Character Recognition (OCR) @ Pitt

This LibGuide introduces users to optical character recognition (OCR), outlines OCR best practices, provides information and resources for OCR tools, and links to example OCR projects.

Acronym Glossary

When reading about OCR projects, one may come across a variety of acronyms for tools, objects, or processes pertaining to OCR. This glossary provides definitions for common acronyms used in the OCR domain.

CER: character error rate

CER is the rate, in percent, at which an OCR software incorrectly transcribes a character (letter, number, punctuation mark, etc.) from a text. A 10% CER means that the OCR incorrectly transcribed one out of every ten characters.

Helpful Resource: "Character Error Rate." READ-COOP SCE.

HTD: handwritten text documents

HTD are documents that contain text handwritten by a person.

HTR: handwritten text recognition

HTR, also referred to as HWR (handwriting recognition), is a software that uses machine learning to analyze handwritten documents, identify individual characters, and properly match them to their digital text equivalents, rendering a computer-typed transcription of the handwritten text. 

Helpful Resource: "Evaluating Offline Handwritten Text Recognition: Which Machine Learning Model is the Winner?" Annoi-Ai.

ICR: intelligent character recognition

Similar to OCR, ICR is machine learning software that can transcribe typed text as well as handwritten text. ICR can successfully transcribe a wider range of fonts for typed text than OCR.

Helpful Resource: "What is Intelligent Character Recognition?" Docusomo.

IWR: intelligent word recognition

IWR uses AI to recognize words from a user-defined dictionary in transcribed texts using the characters identified and transcribed via OCR, ICR, or HTR. IWR is particularly useful when generating transcriptions for handwritten documents with HTR, as pairing IWR with HTR reduces character error rate.

Helpful Resource: "What is IWR? (Intelligent Word Recognition) How is it Related to Document Management?" eFileCabinet.

hOCR: HTML optical character recognition

A file format for representing OCR output, hOCR uses HTML (hypertext markup language) to embed layout information, character confidences, bounding boxes, and style information.

Helpful Resource: “HOCR.” Wikipedia.

ML: machine learning

Machine learning is an application of artificial intelligence that provides systems the ability to automatically learn and improve from data analysis. Machine learning models are what allow OCR programs to ‘learn’ to more accurately identify characters in a text from a training set of documents.

Helpful Resource: "Character Recognition: the Basics." Data Driven Investor.

NER: named entity recognition

NER is a form of analysis in which a program identifies entities in a text. An entity is a word or group of words that refers (most commonly) to a proper noun, such as a person, location, organization, product, event, etc. NER sorts entities into categories, including names of people, geographic locations, organizations, months and days, holidays, and many other types of named entities.

Helpful Resource: "What is named entity recognition (NER) and how can I use it?" super.AI.

OLR: optical layout recognition

OLR is a type of document layout analysis, in which a program segments text zones from non-text zones to prepare a document for OCR. OLR is particularly useful for generating OCR transcriptions for digitized newspapers, as OLR can detect columns in a text or illustrations. OLR is useful for traditional print book scanning and for detecting headers, footers, titles, and other features.

Helpful Resource: "Computer Vision Based Optical Document Layout Analysis: A Compatible Survey." ResearchGate.

TLE: text-line extraction

TLE is the process of segmenting a document page into lines of text for processing.

Helpful Resource: "Text-line Extraction from Handwritten Document Images using GAN." Expert Systems with Applications.

WER: word error rate

WER is the rate, in percent, at which an OCR software incorrectly transcribes or misspells a word from a text. A 10% WER would indicate that out of every ten words transcribed by an OCR software, one of those words was incorrectly transcribed.

Helpful Resource: "Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER)." Towards Data Science.