Following these OCR best practices will help you to effectively and efficiently recognize text in your documents. The following sections outline steps you should take in your OCR project and provide resources that can help you along the way:
Successful OCR projects require at least some planning at the outset. If you are collaborating with others on the project, it's also important that everyone is on the same page in terms of requirements and expectations for the final product. Here are some things to consider when planning your project:
Think through your intentions for the final product. You should consider the level of precision you wish to have in your final text. Should it be a facsimile style representation of the full-text? Or, are certain standards required for the repository you may be sending the text to in the future?
Consider the issues of ethics, copyright, intellectual property, and/or licensing conditions that may impact your OCR project. If you plan to make your OCR'd text available to others, ensure that it doesn't disclose confidential information or breach the privacy. You may need to redact sensitive information, which some OCR software support. You should also make sure that publishing/sharing the OCR'd text does not infringe on copyright laws, intellectual property policies, or licensing conditions. This may require negotiating and acquiring permissions to use content, which can take some time. If you need assistance with navigating copyright/intellectual property/licensing issues and concerns, check out the Library's Copyright and Intellectual Property Toolkit or contact us.
An OCR project can take a considerable amount of time and effort, depending on the number of documents you need to OCR, whether or not you'll be digitizing the documents, the quality of the documents and image preprocessing needs, the quality of the OCR results, and whether you'll need to edit/correct the initial OCR results.
Consider the best file format for your project, based on your research needs, who you want to have access to your text, and/or how you want to make it accessible to your intended audience. For example, if you want (your audience) to be able to view the digitized document as it appears in its original form as well as search, copy, and paste text, then a searchable PDF file format may be suitable for your project. If you want (your audience) to use the OCR'd text for text mining/analysis, then a plain text file format (.txt) may be best.
If you have a fairly large OCR project on your hands, drafting a project charter can help you to determine and document many of the considerations above, break the project into smaller segments, explicitly account for all required pieces/resources, and stay on track.
An OCR software's ability to accurately recognize text in your document is dependent on the condition of the original documents and/or quality of the digital scans. Whether you will be scanning your own documents or using documents that have already been digitized, it's best to assess the quality of your documents before attempting to OCR them. This will give a good idea of what you'll need to do to help the OCR software produce the optimal results, such as editing your scanned images in Photoshop before OCRing. It will also help you to have realistic expectations of the quality of the OCR and plan accordingly, such as allotting time for correcting the initial OCR results.
Consider the following factors when assessing your documents:
The structural elements of your document (e.g., headings, images, tables, captions) may complicate the OCR process. You may need to de-skew or crop your image before OCRing. Some OCR software do this automatically or can be configured by the user to do so.
Handwritten text, special fonts, very small fonts (e.g., 6pt), and low contrast text can all decrease the accuracy of the OCR software.
Texts published before 1850 may not be the most compatible with OCR software due to the fonts used in printing and lack of definition/clarity of the characters.
Typescript results in poorer OCR than printed type, and inconsistent use of font faces and sizes can lower OCR accuracy.
Getting a quality image is the first step toward accurate OCR results. Consider things like resolution, brightness, straightness, and discoloration before you digitize your document(s). You may be able to scan the original document using your OCR program as your scanning software, which should have the best scanning settings for its OCR processing.
If you're scanning your document, follow these best practices:
The recommended resolution for scanning documents for optimal OCR accuracy is 300 dots per inch (dpi). However, if the text font size is particularly small (less than 10pt), a dpi of 400-600 may be best. If the resolution is too high, loading and processing the image will take more time, without improving the quality of the recognition.
Brightness settings that are too high or too low can cause defects to the text (overexposure or obscuring, respectively) and reduce the accuracy of your image. A brightness of 50% is usually best.
The straighter the initial scan, the better the OCR quality. Skewed pages can lead to inaccurate recognition.
Make sure that the document page fits entirely within the frame. Otherwise, content will be cut off and excluded from the scan.
If possible, exclude anything outside of the page of the document, such as the outlines of other pages. If this is not possible, you may have to remove any borders in your digital scans, as they can be erroneously recognized as characters.
Older and discolored documents must be scanned in RGB mode in order to capture all of the image data. Grayscale mode and, especially, black and white mode can cause loss in detail and proper contrast.
If you're photographing your documents, apply the principles above and follow these best practices:
If possible, use a tripod to avoid shaking.
Shoot in color.
Shoot in an uncompressed format.
Shoot in ambient light (preferable daylight) that maximizes the contrast between the text and the background.
Shoot your document against a white board or piece of paper to set white balance
Use a white balance setting that picks up on areas that the camera thinks are white, then adjusts the color balance.
Turn of the flash to avoid glare.
Avoid using any sharpening or other contrast/clarity-boosting filters to prevent graininess.
Position the lens parallel to the plane of the document and point it toward the center of the text. The distance between the camera and the document should, usually, be 50-60 cm.
If you need access to high-quality scanning equipment, visit our Digital Stewardship Lab. For support, training, instruction, and consultation for digitization projects, contact us to get support from our Digital Creation Specialist. To learn more about digitization requests, please see our Specialized Digitization policy.
Depending on your project needs and the OCR software you choose, the text recognition step may be fairly straightforward or more complex. Here are some recommended practices for actually working with OCR software to recognize text in your documents:
Make sure that your choice of OCR software supports the needs of your project, such as the appropriate language(s), functionality, file formats, etc. Some programs may allow you to use additional language packages to supplement its language sets.
It's always best to familiarize yourself with the functions, features, and settings of the OCR software you're using. Reading through the manual and exploring the interface will help you to make the most of the software and configure the settings so that they're the most suitable for your project.
It may be worth using specialized training data, models/patterns, and/or dictionaries to increase recognition of text in your particular set of documents. Some OCR software enable you to modify or disable default patterns and dictionaries, if they're not appropriate for your documents, and to create/import your own.
In order to achieve or approximate 100% text accuracy, you may need to check and correct the text after the initial recognition process is complete. The OCR software cannot do this post-recognition verification itself. Keep in mind that the editing/correcting process can be labor-intensive and time-consuming, especially if the quality of original document was poor and/or if you have a large amount of text to correct.
Review at least a sample of your OCR'd text to confirm that the text was recognized correctly. If there are a significant number of errors to correct, you'll want to take note of any patterns in errors so that you can correct them efficiently/consistently and document your process. This proofreading process can be done either in the OCR program you're using or in a text editor, preferably one with spelling and grammar checking.
To save time and effort, automate the checking and correction process by using text preprocessing tools for removing unwanted characters and white spaces, spelling correction, etc.
When hand-correcting the text, it's best to save the corrected text as a separate file, rather than overwriting the original output. This will enable you to return to the original output file, in case something goes awry or otherwise comes up along the correction process. Relatedly, make sure to save as you go and, if possible, use a tool that supports file versioning.