Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
"Text Mining & Analysis @ Pitt" by University of Pittsburgh Library System is licensed for reuse under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Most text is created and stored so that humans to read and use, but text analysis/mining requires your text data to be machine readable (i.e., in a form that a computer can process), structured, and clean. Hence, after gathering your text data, the next step usually entails optical character recognition (OCR), if you're working with document scans or images, and/or text preprocessing (e.g., parsing, cleaning, transforming).
This guide provides tools and helpful resources for the following text data preparation tasks: