Guides: Text Mining &amp; Analysis @ Pitt: Preprocessing Text

Preprocessing Text

Whether you’re working with digitized or born-digital text, you will likely have to preprocess your text data before you can properly analyze them. The algorithms used in natural language processing work best when the text data is structured, with at least some regular, identifiable patterns. To identify the preprocessing steps required for your project, you'll need to know what data structure/format is best for the analysis methods and tools you plan to use.

Common Issues

Noisy data: corrupted, distorted, meaningless, or irrelevant data that impede machine reading and/or adversely affect the results of any data mining analysis.
- Irrelevant text, such as stop words (e.g., “the”, “a”, “an”, “in,” “she”), numbers, punctuation, symbols, and markup language tags (e.g., HTML and XML)
- Images, tables, and figures may present complications when extracting data from documents (e.g., causing OCR software misrecognize and garble or introduce noise in text data).
- Low-quality OCR’d text—due to the age of document, quality of document, font type, or sophistication of OCR algorithm—may result in typos, garbled text, and other errors (e.g., recognizing the letter 'm' and the letters 'rn').
- Formatting elements such as headers, footers, and column breaks can create noise in your text data (e.g., journal titles and page numbers may not be relevant to your analysis).
Unstructured data: data that does not have a predefined data model or format. Often, specific data will need to be extracted, categorized, formatted, and/or otherwise organized so that is usable for a specific text mining task or set of tasks.

Common Techniques

Removing stop words: filter out commonly used and auxiliary words (e.g., “the”, “a”, “an”, “in,” “she”)
Removing irrelevant characters: ignore numbers, punctuation, symbols, etc.
Removing markup language tags (e.g., HTML, SGML, XML)
Normalizing case: remove redundancies by ignoring case (e.g., "key" and "Key" will not be considered different words if case is ignored)
Correcting errors: remove typos, garbled text (e.g., unwanted symbols in place of letters), and other errors (especially with OCR’d text)
Tokenization: split a sequence of strings into tokens (e.g., words, keywords, phrases, sentences, and other elements); enables analysis down to the level of segmentation; used in the models, like bag-of-words, for term frequency counting, text clustering, and document matching tasks
Stemming: reduce inflected words to their root forms (e.g., trouble, troubled, troubling → troubl-); improves the performance of text clustering tasks by reducing dimensions (i.e., the number of terms to be processed)
Lemmatization: reduce inflected words to their lemma, or linguistic root word, the canonical/dictionary form of the word (e.g., swims, swimming, swam → swim); improves the performance of text clustering tasks by reducing dimensions (i.e., the number of terms to be processed)
Part-of-Speech (PoS) tagging: assign a tag to each token in a document to denote its part of speech (i.e., nouns, verbs, adjectives, etc.); enables semantic analysis on unstructured text
Text classification/categorization: assign tags or categories to text according to predefined topics or categories

Tools

Out-of-the-Box

OpenRefine
For fetching, exploring, cleaning, transforming, reconciling and matching data
Factorie
For natural language processing and information integration such as segmentation, tokenization, part-of-speech tagging, named entity recognition, dependency parsing, mention finding, coreference, lexicon-matching, and latent Dirichlet allocation
Vard2
For cleaning historical texts by normalizing spelling variation (particularly Early Modern English)
TextFixer
For changing case, removing whitespace and line breaks, sorting and converting text
Porter stemmer online
For stemming text
Lexos
For removing characters, whitespaces, stopwords, and lemmatizing
Text Editor with regular expressions functionality: Atom, NotePad++, Sublime Text, etc.
For find and replace operations and input validation

Programmatic

Python

NLTK (Natural Language Toolkit)
For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks
spaCy
For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
scikit-learn
For classification, regression, clustering, dimensionality reduction, model selection, and preprocessing
NLP Architect
For word chunking, named entity recognition, dependency parsing, intent extraction, sentiment classification, language models, transformations, Aspect Based Sentiment Analysis (ABSA), joint intent detection and slot tagging, noun phrase embedding representation (NP2Vec), most common word sense detection, relation identification, cross document coreference, noun phrase semantic segmentation, term set expansion, topics and trend analysis, optimizing NLP/NLU models
flair
For part-of-speech tagging (PoS), named entity recognition (NER), classification, sense disambiguation, word and document embeddings
HuggingFace Transformers
For classification, information extraction, question answering, summarization, translation, text generation, masked language prediction, and other NLP, NLU (Natural Language Understanding), and NLG (Natural Language Generation) tasks
TextBlob
For processing textual data, providing a simple API for diving into common natural language processing (NLP) tasks such as noun phrase extraction, part-of-speech tagging, classification (Naive Bayes, Decision Tree), translation, tokenization (splitting text into words and sentences), parsing n-grams, word inflection (pluralization and singularization) and lemmatization, spelling correction, adding new models or languages through extensions, and wordNet integration
Spark NLP
For tokenization, word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, spell checking, multi-class text classification, transformation (BERT, XLNet, ELMO, ALBERT, and Universal Sentence Encoder), multi-class sentiment analysis, machine translation (+180 languages), summarization and question Answering (Google T5), and many more NLP tasks
PyNLPI

For tokenization, n-gram extraction, building simple language models, Levenshtein distance calculation, and parsing and processing file formats common in NLP (e.g., FoLiA, GIZA++, Moses++, ARPA, TiMBL)
Polyglot
For tokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part-of-speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 languages)
Stanza
For tokenizing (words and sentences), multi-word token expansion, lemmatization, part-of-speech and morphology tagging, dependency parsing, and named entity recognition in a pipeline
tmtoolkit
For tokenization, part-of-speech (POS) tagging (via SpaCy), lemmatization and term normalization, pattern matching (exact matching, regular expressions or “glob” patterns) for various methods (e.g., for filtering on token, document or document label level, or for keywords-in-context), adding and managing custom token metadata, accessing word vectors (word embeddings), generating n-grams, generating sparse document-term matrices, expanding compound words and “gluing” of specified subsequent tokens
Pattern
For webscraping (Google, Wikipedia, Twitter, Facebook, generic RSS, etc.), web crawling, HTML DOM parsing, part-of-speech tagging, n-gram search, sentiment analysis, vector space modeling, clustering, classification (KNN, SVM, Perceptron), graph centrality and visualization
Beautiful Soup
For parsing and extracting data from HTML and XML documents

openNLP
For NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution
tm
For importing and handling corpus data, metadata management, stemming, stop word deletion, removal of white space, string processing, count-based analysis methods, text clustering, text classification, and string kernels
quanteda
For corpus management, creating and manipulating tokens and ngrams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and distances, applying content dictionaries, applying supervised and unsupervised machine learning, sentiment analysis, visually representing text, and more.
RcmdrPlugin.temis
For performing a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering
RWEKA
For stemming, data transformation, distribution-based balancing of datasets, replacing missing numerical values, dataset resampling, anonymization, normalization, classification, regression, clustering, association rules, and visualization
boilerpipeR
For the extraction and sanitizing of text content from HTML files: removal of ads, sidebars, and headers using the boilerpipe Java library
stringi
For conversion of text encodings, string searching and collation in any locale, Unicode normalization of text, handling texts with mixed reading direction (e.g., left to right and right to left), and text boundary analysis (for tokenizing on different aggregation levels or to identify suitable line wrapping locations); provides R language wrappers to the International Components for Unicode (ICU) library
Rstem
For stemming; (available from Omegahat) alternative interface to a C version of Porter's word stemming algorithm
ore
For handling regular expressions, based on the Onigmo Regular Expression Library; offers first-class compiled regex objects, partial matching and function-based substitutions, amongst other features
hunspell
For stemming and spell-checking languages with rich morphology and complex word compounding or character encoding. The package can check and analyze individual words as well as search for incorrect words within a text, latex or (R package) manual document
sentencepiece
For unsupervised tokenizing, producing Byte Pair Encoding (BPE), Unigram, Char, or Word models
tokenizers
For splitting text into tokens, supporting shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions
tokenizers.bpe
For splitting text into syllable tokens, implemented using Byte Pair Encoding and the YouTokenToMe library

Java

CoreNLP
For deriving linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment analysis, quote attributions, and relations
Weka
For data preprocessing (e.g., stemming, data resampling, transformation), classification, regression, clustering, latent semantic analysis (LSA, LSI), association rules, visualization, filtering, and anonymization

Helpful Resources

University of Pittsburgh Library System

Course & Subject Guides

Text Mining & Analysis @ Pitt

Get Help with Text Mining & Analysis

Guide Contributors