Skip to Main Content

Course & Subject Guides

Text Mining & Analysis @ Pitt

An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.

Preprocessing Text

Whether you’re working with digitized or born-digital text, you will likely have to preprocess your text data before you can properly analyze them. The algorithms used in natural language processing work best when the text data is structured, with at least some regular, identifiable patterns. To identify the preprocessing steps required for your project, you'll need to know what data structure/format is best for the analysis methods and tools you plan to use. 

 

Common Issues

 

  • Noisy data: corrupted, distorted, meaningless, or irrelevant data that impede machine reading and/or adversely affect the results of any data mining analysis. 

    • Irrelevant text, such as stop words (e.g., “the”, “a”, “an”, “in,” “she”), numbers, punctuation, symbols, and markup language tags (e.g., HTML and XML)

    • Images, tables, and figures may present complications when extracting data from documents (e.g., causing OCR software misrecognize and garble or introduce noise in text data). 

    • Low-quality OCR’d text—due to the age of document, quality of document, font type, or sophistication of OCR algorithm—may result in typos, garbled text, and other errors (e.g., recognizing the letter 'm' and the letters 'rn').

    • Formatting elements such as headers, footers, and column breaks can create noise in your text data (e.g., journal titles and page numbers may not be relevant to your analysis).

  • Unstructured data: data that does not have a predefined data model or format. Often, specific data will need to be extracted, categorized, formatted, and/or otherwise organized so that is usable for a specific text mining task or set of tasks.

 

Common Techniques

 

  • Removing stop words: filter out commonly used and auxiliary words (e.g., “the”, “a”, “an”, “in,” “she”)

  • Removing irrelevant characters: ignore numbers, punctuation, symbols, etc.

  • Removing markup language tags (e.g., HTML, SGML, XML)

  • Normalizing case: remove redundancies by ignoring case (e.g., "key" and "Key" will not be considered different words if case is ignored)

  • Correcting errors: remove typos, garbled text (e.g., unwanted symbols in place of letters), and other errors (especially with OCR’d text)

  • Tokenization: split a sequence of strings into tokens (e.g., words, keywords, phrases, sentences, and other elements); enables analysis down to the level of segmentation; used in the models, like bag-of-words, for term frequency counting, text clustering, and document matching tasks

  • Stemming: reduce inflected words to their root forms (e.g., trouble, troubled, troubling → troubl-); improves the performance of text clustering tasks by reducing dimensions (i.e., the number of terms to be processed)

  • Lemmatization: reduce inflected words to their lemma, or linguistic root word, the canonical/dictionary form of the word (e.g., swims, swimming, swam → swim); improves the performance of text clustering tasks by reducing dimensions (i.e., the number of terms to be processed)

  • Part-of-Speech (PoS) tagging: assign a tag to each token in a document to denote its part of speech (i.e., nouns, verbs, adjectives, etc.); enables semantic analysis on unstructured text

  • Text classification/categorization: assign tags or categories to text according to predefined topics or categories

Tools

 

Out-of-the-Box
  • OpenRefine
    For fetching, exploring, cleaning, transforming, reconciling and matching data

  • Factorie
    For natural language processing and information integration such as segmentation, tokenization, part-of-speech tagging, named entity recognition, dependency parsing, mention finding, coreference, lexicon-matching, and latent Dirichlet allocation

  • Vard2
    For cleaning historical texts by normalizing spelling variation (particularly Early Modern English)

  • TextFixer
    For changing case, removing whitespace and line breaks, sorting and converting text

  • Porter stemmer online
    For stemming text

  • Lexos
    For removing characters, whitespaces, stopwords, and lemmatizing

  • Text Editor with regular expressions functionality: Atom, NotePad++, Sublime Text, etc.
    For find and replace operations and input validation

 
Programmatic

Python

  • NLTK (Natural Language Toolkit)
    For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks

  • spaCy
    For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more

  • scikit-learn
    For classification, regression, clustering, dimensionality reduction, model selection, and preprocessing

  • NLP Architect
    For word chunking, named entity recognition, dependency parsing, intent extraction, sentiment classification, language models, transformations, Aspect Based Sentiment Analysis (ABSA), joint intent detection and slot tagging, noun phrase embedding representation (NP2Vec), most common word sense detection, relation identification, cross document coreference, noun phrase semantic segmentation, term set expansion, topics and trend analysis, optimizing NLP/NLU models

  • flair
    For part-of-speech tagging (PoS), named entity recognition (NER), classification, sense disambiguation, word and document embeddings

  • HuggingFace Transformers
    For classification, information extraction, question answering, summarization, translation, text generation, masked language prediction, and other NLP, NLU (Natural Language Understanding), and NLG (Natural Language Generation) tasks

  • TextBlob
    For processing textual data, providing a simple API for diving into common natural language processing (NLP) tasks such as noun phrase extraction, part-of-speech tagging, classification (Naive Bayes, Decision Tree), translation, tokenization (splitting text into words and sentences), parsing n-grams, word inflection (pluralization and singularization) and lemmatization, spelling correction, adding new models or languages through extensions, and wordNet integration

  • Spark NLP
    For tokenization, word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, spell checking, multi-class text classification, transformation (BERT, XLNet, ELMO, ALBERT, and Universal Sentence Encoder), multi-class sentiment analysis, machine translation (+180 languages), summarization and question Answering (Google T5), and many more NLP tasks

  • PyNLPI

    For tokenization, n-gram extraction, building simple language models, Levenshtein distance calculation, and parsing and processing file formats common in NLP (e.g., FoLiA, GIZA++, Moses++, ARPA, TiMBL)

  • Polyglot
    For tokenization (165 languages), language detection (196 languages), named entity recognition (40 languages), part-of-speech tagging (16 languages), sentiment analysis (136 languages), word embeddings (137 languages), morphological analysis (135 languages), transliteration (69 languages)

  • Stanza
    For tokenizing (words and sentences), multi-word token expansion, lemmatization, part-of-speech and morphology tagging, dependency parsing, and named entity recognition in a pipeline

  • tmtoolkit
    For tokenization, part-of-speech (POS) tagging (via SpaCy), lemmatization and term normalization, pattern matching (exact matching, regular expressions or “glob” patterns) for various methods (e.g., for filtering on token, document or document label level, or for keywords-in-context), adding and managing custom token metadata, accessing word vectors (word embeddings), generating n-grams, generating sparse document-term matrices, expanding compound words and “gluing” of specified subsequent tokens

  • Pattern
    For webscraping (Google, Wikipedia, Twitter, Facebook, generic RSS, etc.), web crawling, HTML DOM parsing, part-of-speech tagging, n-gram search, sentiment analysis, vector space modeling, clustering, classification (KNN, SVM, Perceptron), graph centrality and visualization

  • Beautiful Soup
    For parsing and extracting data from HTML and XML documents

R

  • openNLP
    For NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution

  • tm
    For importing and handling corpus data, metadata management, stemming, stop word deletion, removal of white space, string processing, count-based analysis methods, text clustering, text classification, and string kernels

  • quanteda
    For corpus management, creating and manipulating tokens and ngrams, exploring keywords in context, forming and manipulating sparse matrices of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and distances, applying content dictionaries, applying supervised and unsupervised machine learning, sentiment analysis, visually representing text, and more.

  • RcmdrPlugin.temis
    For performing a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering

  • RWEKA
    For stemming, data transformation, distribution-based balancing of datasets, replacing missing numerical values, dataset resampling, anonymization, normalization, classification, regression, clustering, association rules, and visualization

  • boilerpipeR
    For the extraction and sanitizing of text content from HTML files: removal of ads, sidebars, and headers using the boilerpipe Java library

  • stringi
    For conversion of text encodings, string searching and collation in any locale, Unicode normalization of text, handling texts with mixed reading direction (e.g., left to right and right to left), and text boundary analysis (for tokenizing on different aggregation levels or to identify suitable line wrapping locations); provides R language wrappers to the International Components for Unicode (ICU) library

  • Rstem
    For stemming; (available from Omegahat) alternative interface to a C version of Porter's word stemming algorithm

  • ore
    For handling regular expressions, based on the Onigmo Regular Expression Library; offers first-class compiled regex objects, partial matching and function-based substitutions, amongst other features

  • hunspell
    For stemming and spell-checking languages with rich morphology and complex word compounding or character encoding. The package can check and analyze individual words as well as search for incorrect words within a text, latex or (R package) manual document

  • sentencepiece
    For unsupervised tokenizing, producing Byte Pair Encoding (BPE), Unigram, Char, or Word models

  • tokenizers
    For splitting text into tokens, supporting shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions

  • tokenizers.bpe
    For splitting text into syllable tokens, implemented using Byte Pair Encoding and the YouTokenToMe library

Java

  • CoreNLP
    For deriving linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment analysis, quote attributions, and relations

  • Weka
    For data preprocessing (e.g., stemming, data resampling, transformation), classification, regression, clustering, latent semantic analysis (LSA, LSI), association rules, visualization, filtering, and anonymization

Helpful Resources