Guides: Text Mining &amp; Analysis @ Pitt: Preparing Text Data

Preparing Text Data

Most text is created and stored so that humans to read and use, but text analysis/mining requires your text data to be machine readable (i.e., in a form that a computer can process), structured, and clean. Hence, after gathering your text data, the next step usually entails optical character recognition (OCR), if you're working with document scans or images, and/or text preprocessing (e.g., parsing, cleaning, transforming).

This guide provides tools and helpful resources for the following text data preparation tasks:

Optical Character Recognition (OCR)
Preprocessing text data

University of Pittsburgh Library System

Course & Subject Guides

Text Mining & Analysis @ Pitt

Get Help with Text Mining & Analysis

Guide Contributors

License

Preparing Text Data