Skip to Main Content

Course & Subject Guides

Text Mining & Analysis @ Pitt

An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.

Collocation

Collocation traces the appearance of words that commonly appear next to each other in a text or series of text in order to analyze the words' importance. 

 

Tools

 

Out-of-the-Box
  • Voyant Tools
    Web-based reading and analysis environment for digital texts, for performing tasks such as word frequencies, collocations, concordance, visualization (graphs, grids, word clouds, etc.)

  • AntConc
    Freeware, multi-platform, multi-purpose corpus analysis toolkit that hosts a comprehensive set of tools, including a powerful concordancer, word and keyword frequency generators, tools for cluster and lexical bundle analysis, and a word distribution plot

  • WordHoard
    Application for the close reading and scholarly analysis of deeply tagged texts, including word frequencies, concordances, collocations, and scripting

  • CasualConc
    Concordance program for macOS, designed for exploratory-type text analysis and visualization of frequency data, including keyword in context (KWIC) concordance lines, word clusters, collocation analysis, and word count

 
Programmatic

Python

  • NLTK (Natural Language Toolkit)
    For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks

  • spaCy
    For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more

  • scikit-learn
    For classification, regression, clustering, dimensionality reduction, model selection, and preprocessing

R

  • tidytext
    For converting to and from non-tidy formats, word and document frequency analysis (tf-idf), n-grams and correlations, sentiment analysis with tidy data, and topic modeling

  • openNLP
    For NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution

  • RcmdrPlugin.temis
    For performing a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering

  • RWEKA
    For stemming, data transformation, distribution-based balancing of datasets, replacing missing numerical values, dataset resampling, anonymization, normalization, classification, regression, clustering, association rules, and visualization

  • tm
    For importing and handling corpus data, metadata management, stemming, stop word deletion, removal of white space, string processing, count-based analysis methods, text clustering, text classification, and string kernels