Guides: Text Mining &amp; Analysis @ Pitt: Term Frequency

Term frequency examines the importance of words in a text or set of texts by measuring how often certain words appear. This includes raw and relative frequency counts and percentages.

Tools

Out-of-the-Box

HathiTrust+Bookworm
Interactive line graph showing word use trends in 13.7 million HathiTrust volumes
Voyant Tools
Web-based reading and analysis environment for digital texts, for performing tasks such as word frequencies, collocations, concordance, visualization (graphs, grids, word clouds, etc.); make sure to check out the tool documentation!
Google Ngram Viewer
Online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in sources printed between 1500 and 2019 in Google's text corpora in English, Chinese, French, German, Hebrew, Italian, Russian, or Spanish
AntConc
Freeware, multi-platform, multi-purpose corpus analysis toolkit that hosts a comprehensive set of tools, including a powerful concordancer, word and keyword frequency generators, tools for cluster and lexical bundle analysis, and a word distribution plot
AntWord Profiler
Vocabulary level and complexity analysis, word frequencies
WordHoard
Application for the close reading and scholarly analysis of deeply tagged texts, including word frequencies, concordances, collocations, and scripting
CasualConc
Concordance program for macOS, designed for exploratory-type text analysis and visualization of frequency data, including keyword in context (KWIC) concordance lines, word clusters, collocation analysis, and word count
Overview
Open-source visualization and analysis tool designed for sets of documents; includes built-in OCR, a sophisticated search engine for full text search, document annotation, topic-based document clustering, entity detection, word clouds and other visualizations
TextSTAT
Text analysis and concordance program for KWIC searches, word frequencies, and concordances

Programmatic

Python

NLTK (Natural Language Toolkit)
For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks
spaCy
For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
TextBlob
For processing textual data, providing a simple API for diving into common natural language processing (NLP) tasks such as noun phrase extraction, part-of-speech tagging, sentiment analysis, classification (Naive Bayes, Decision Tree), translation, tokenization (splitting text into words and sentences), word and phrase frequencies, parsing n-grams, word inflection (pluralization and singularization) and lemmatization, spelling correction, adding new models or languages through extensions, and wordNet integration

R

tidytext
For converting to and from non-tidy formats, word and document frequency analysis (tf-idf), n-grams and correlations, sentiment analysis with tidy data, and topic modeling
openNLP
For NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution
RcmdrPlugin.temis
For performing a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering
RWEKA
For stemming, data transformation, distribution-based balancing of datasets, replacing missing numerical values, dataset resampling, anonymization, normalization, classification, regression, clustering, association rules, and visualization
tm
For importing and handling corpus data, metadata management, stemming, stop word deletion, removal of white space, string processing, count-based analysis methods, text clustering, text classification, and string kernels
wordcloud
provides a visualisation similar to the famous wordle ones: it horizontally and vertically distributes features in a pleasing visualisation with the font size scaled by frequency
zipfR
offers some statistical models for word frequency distributions. The utilities include functions for loading, manipulating and visualizing word frequency data and vocabulary growth curves. The package also implements several statistical models for the distribution of word frequencies in a population. (The name of this library derives from the most famous word frequency distribution, Zipf's law.)

Helpful Resources

Example Projects

Ohge, Christopher. "Digital Text Analysis of Herman Melville’s Marginalia in Shakespeare [A Progress Report]." Christopher Ohge, September 13, 2018.

University of Pittsburgh Library System

Course & Subject Guides

Text Mining & Analysis @ Pitt

Get Help with Text Mining & Analysis

Guide Contributors

Term Frequency

Tools

Out-of-the-Box

Programmatic

Helpful Resources

Example Projects