Skip to Main Content

Course & Subject Guides

Text Mining & Analysis @ Pitt

An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.

Topic Modeling

Topic modeling is used to analyze clusters of "topics" or co-occurring words in a text or series of texts, often with the aim of understanding recurring themes.

 

Tools

 

Out-of-the-Box
  • MALLET
    For statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text

  • Topic Modeling Tool
    For Latent Dirichlet Allocation (LDA) topic modeling

  • Factorie
    For natural language processing and information integration such as segmentation, tokenization, part-of-speech tagging, named entity recognition, dependency parsing, mention finding, coreference, lexicon-matching, and latent Dirichlet allocation

  • jsLDA
    For in-browser topic modeling

Programmatic

Python

  • Genism
    For latent semantic analysis (LSA, LSI, SVD), unsupervised topic modeling (Latent Dirichlet allocation; LDA), embeddings (fastText, word2vec, doc2vec), non-negative matrix factorization (NMF), and term frequency–inverse document frequency (tf-idf)

  • NLTK (Natural Language Toolkit)
    For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks

  • spaCy
    For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more

  • scikit-learn
    For classification, regression, clustering, dimensionality reduction, model selection, and preprocessing

  • NLP Architect
    For word chunking, named entity recognition, dependency parsing, intent extraction, sentiment classification, language models, transformations, Aspect Based Sentiment Analysis (ABSA), joint intent detection and slot tagging, noun phrase embedding representation (NP2Vec), most common word sense detection, relation identification, cross document coreference, noun phrase semantic segmentation, term set expansion, topics and trend analysis, optimizing NLP/NLU models

  • Top2Vec
    For topic modeling, semantic search, and word and document embeddings

R

  • tidytext
    For converting to and from non-tidy formats, word and document frequency analysis (tf-idf), n-grams and correlations, sentiment analysis with tidy data, and topic modeling

  • topicmodels
    For Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors; provides an interface to the C code

  • BTM
    For identifying topics in texts from term-term cooccurrences (hence 'biterm' topic model, BTM)

  • topicdoc
    For LDA and CTM topic models to assist in evaluating topic quality; provide topic-specific diagnostics

  • lda
    For Latent Dirichlet Allocation and related models similar to LSA and topic models

  • stm (Structural Topic Model)
    For implementing a topic model derivate that can include document-level meta-data; also includes tools for model selection, visualization, and estimation of topic-covariate regressions

  • text2vec
    For text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), and similarities

  • mscstexta4r
    For sentiment analysis, topic detection, language detection, and key phrase extraction; provides an interface to the Microsoft Cognitive Services Text Analytics API

Java

  • Weka
    For data preprocessing (e.g., stemming, data resampling, transformation), classification, regression, clustering, latent semantic analysis (LSA, LSI), association rules, visualization, filtering, and anonymization