Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
"Optical Character Recognition (OCR) @ Pitt" by University of Pittsburgh Library System is licensed for reuse under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
When reading about OCR projects, one may come across a variety of acronyms for tools, objects, or processes pertaining to OCR. This glossary provides definitions for common acronyms used in the OCR domain.
CER is the rate, in percent, at which an OCR software incorrectly transcribes a character (letter, number, punctuation mark, etc.) from a text. A 10% CER means that the OCR incorrectly transcribed one out of every ten characters.
Helpful Resource: "Character Error Rate." READ-COOP SCE.
HTD are documents that contain text handwritten by a person.
HTR, also referred to as HWR (handwriting recognition), is a software that uses machine learning to analyze handwritten documents, identify individual characters, and properly match them to their digital text equivalents, rendering a computer-typed transcription of the handwritten text.
Helpful Resource: "Evaluating Offline Handwritten Text Recognition: Which Machine Learning Model is the Winner?" Annoi-Ai.
Similar to OCR, ICR is machine learning software that can transcribe typed text as well as handwritten text. ICR can successfully transcribe a wider range of fonts for typed text than OCR.
Helpful Resource: "What is Intelligent Character Recognition?" Docusomo.
IWR uses AI to recognize words from a user-defined dictionary in transcribed texts using the characters identified and transcribed via OCR, ICR, or HTR. IWR is particularly useful when generating transcriptions for handwritten documents with HTR, as pairing IWR with HTR reduces character error rate.
Helpful Resource: "What is IWR? (Intelligent Word Recognition) How is it Related to Document Management?" eFileCabinet.
A file format for representing OCR output, hOCR uses HTML (hypertext markup language) to embed layout information, character confidences, bounding boxes, and style information.
Helpful Resource: “HOCR.” Wikipedia.
Machine learning is an application of artificial intelligence that provides systems the ability to automatically learn and improve from data analysis. Machine learning models are what allow OCR programs to ‘learn’ to more accurately identify characters in a text from a training set of documents.
Helpful Resource: "Character Recognition: the Basics." Data Driven Investor.
NER is a form of analysis in which a program identifies entities in a text. An entity is a word or group of words that refers (most commonly) to a proper noun, such as a person, location, organization, product, event, etc. NER sorts entities into categories, including names of people, geographic locations, organizations, months and days, holidays, and many other types of named entities.
Helpful Resource: "What is named entity recognition (NER) and how can I use it?" super.AI.
OLR is a type of document layout analysis, in which a program segments text zones from non-text zones to prepare a document for OCR. OLR is particularly useful for generating OCR transcriptions for digitized newspapers, as OLR can detect columns in a text or illustrations. OLR is useful for traditional print book scanning and for detecting headers, footers, titles, and other features.
Helpful Resource: "Computer Vision Based Optical Document Layout Analysis: A Compatible Survey." ResearchGate.
TLE is the process of segmenting a document page into lines of text for processing.
Helpful Resource: "Text-line Extraction from Handwritten Document Images using GAN." Expert Systems with Applications.
WER is the rate, in percent, at which an OCR software incorrectly transcribes or misspells a word from a text. A 10% WER would indicate that out of every ten words transcribed by an OCR software, one of those words was incorrectly transcribed.
Helpful Resource: "Evaluate OCR Output Quality with Character Error Rate (CER) and Word Error Rate (WER)." Towards Data Science.