Guides: Text Mining &amp; Analysis @ Pitt: Collocation

Collocation traces the appearance of words that commonly appear next to each other in a text or series of text in order to analyze the words' importance.

Tools

Out-of-the-Box

Voyant Tools
Web-based reading and analysis environment for digital texts, for performing tasks such as word frequencies, collocations, concordance, visualization (graphs, grids, word clouds, etc.)
AntConc
Freeware, multi-platform, multi-purpose corpus analysis toolkit that hosts a comprehensive set of tools, including a powerful concordancer, word and keyword frequency generators, tools for cluster and lexical bundle analysis, and a word distribution plot
WordHoard
Application for the close reading and scholarly analysis of deeply tagged texts, including word frequencies, concordances, collocations, and scripting
CasualConc
Concordance program for macOS, designed for exploratory-type text analysis and visualization of frequency data, including keyword in context (KWIC) concordance lines, word clusters, collocation analysis, and word count

Programmatic

Python

NLTK (Natural Language Toolkit)
For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks
spaCy
For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
scikit-learn
For classification, regression, clustering, dimensionality reduction, model selection, and preprocessing

R

tidytext
For converting to and from non-tidy formats, word and document frequency analysis (tf-idf), n-grams and correlations, sentiment analysis with tidy data, and topic modeling
openNLP
For NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution
RcmdrPlugin.temis
For performing a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering
RWEKA
For stemming, data transformation, distribution-based balancing of datasets, replacing missing numerical values, dataset resampling, anonymization, normalization, classification, regression, clustering, association rules, and visualization
tm
For importing and handling corpus data, metadata management, stemming, stop word deletion, removal of white space, string processing, count-based analysis methods, text clustering, text classification, and string kernels

Helpful Resources

Example Projects

Merriam, Thomas. 2019. “‘Six-Word Collocations in Shakespeare and Sir Thomas More’—Revisited.” Notes and Queries 66 (3): 415–16.

University of Pittsburgh Library System

Course & Subject Guides

Text Mining & Analysis @ Pitt

Get Help with Text Mining & Analysis

Guide Contributors

Collocation

Tools

Out-of-the-Box

Programmatic

Helpful Resources

Example Projects