Skip to Main Content

Course & Subject Guides

Text Mining & Analysis @ Pitt

An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.

Term Frequency

Term frequency examines the importance of words in a text or set of texts by measuring how often certain words appear. This includes raw and relative frequency counts and percentages.

 

Tools

 

Out-of-the-Box
  • HathiTrust+Bookworm
    Interactive line graph showing word use trends in 13.7 million HathiTrust volumes

  • Voyant Tools
    Web-based reading and analysis environment for digital texts, for performing tasks such as word frequencies, collocations, concordance, visualization (graphs, grids, word clouds, etc.); make sure to check out the tool documentation!

  • Google Ngram Viewer
    Online search engine that charts the frequencies of any set of search strings using a yearly count of n-grams found in sources printed between 1500 and 2019 in Google's text corpora in English, Chinese, French, German, Hebrew, Italian, Russian, or Spanish

  • AntConc
    Freeware, multi-platform, multi-purpose corpus analysis toolkit that hosts a comprehensive set of tools, including a powerful concordancer, word and keyword frequency generators, tools for cluster and lexical bundle analysis, and a word distribution plot

  • AntWord Profiler
    Vocabulary level and complexity analysis, word frequencies

  • WordHoard
    Application for the close reading and scholarly analysis of deeply tagged texts, including word frequencies, concordances, collocations, and scripting

  • CasualConc
    Concordance program for macOS, designed for exploratory-type text analysis and visualization of frequency data, including keyword in context (KWIC) concordance lines, word clusters, collocation analysis, and word count

  • Overview
    Open-source visualization and analysis tool designed for sets of documents; includes built-in OCR, a sophisticated search engine for full text search, document annotation, topic-based document clustering, entity detection, word clouds and other visualizations

  • TextSTAT
    Text analysis and concordance program for KWIC searches, word frequencies, and concordances

 
Programmatic

Python

  • NLTK (Natural Language Toolkit)
    For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks

  • spaCy
    For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more

  • TextBlob
    For processing textual data, providing a simple API for diving into common natural language processing (NLP) tasks such as noun phrase extraction, part-of-speech tagging, sentiment analysis, classification (Naive Bayes, Decision Tree), translation, tokenization (splitting text into words and sentences), word and phrase frequencies, parsing n-grams, word inflection (pluralization and singularization) and lemmatization, spelling correction, adding new models or languages through extensions, and wordNet integration

R

  • tidytext
    For converting to and from non-tidy formats, word and document frequency analysis (tf-idf), n-grams and correlations, sentiment analysis with tidy data, and topic modeling

  • openNLP
    For NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution

  • RcmdrPlugin.temis
    For performing a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering

  • RWEKA
    For stemming, data transformation, distribution-based balancing of datasets, replacing missing numerical values, dataset resampling, anonymization, normalization, classification, regression, clustering, association rules, and visualization

  • tm
    For importing and handling corpus data, metadata management, stemming, stop word deletion, removal of white space, string processing, count-based analysis methods, text clustering, text classification, and string kernels

  • wordcloud
    provides a visualisation similar to the famous wordle ones: it horizontally and vertically distributes features in a pleasing visualisation with the font size scaled by frequency

  • zipfR
    offers some statistical models for word frequency distributions. The utilities include functions for loading, manipulating and visualizing word frequency data and vocabulary growth curves. The package also implements several statistical models for the distribution of word frequencies in a population. (The name of this library derives from the most famous word frequency distribution, Zipf's law.)