Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
Topic modeling is used to analyze clusters of "topics" or co-occurring words in a text or series of texts, often with the aim of understanding recurring themes.
MALLET
For statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
Topic Modeling Tool
For Latent Dirichlet Allocation (LDA) topic modeling
Factorie
For natural language processing and information integration such as segmentation, tokenization, part-of-speech tagging, named entity recognition, dependency parsing, mention finding, coreference, lexicon-matching, and latent Dirichlet allocation
jsLDA
For in-browser topic modeling
Genism
For latent semantic analysis (LSA, LSI, SVD), unsupervised topic modeling (Latent Dirichlet allocation; LDA), embeddings (fastText, word2vec, doc2vec), non-negative matrix factorization (NMF), and term frequency–inverse document frequency (tf-idf)
NLTK (Natural Language Toolkit)
For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks
spaCy
For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
scikit-learn
For classification, regression, clustering, dimensionality reduction, model selection, and preprocessing
NLP Architect
For word chunking, named entity recognition, dependency parsing, intent extraction, sentiment classification, language models, transformations, Aspect Based Sentiment Analysis (ABSA), joint intent detection and slot tagging, noun phrase embedding representation (NP2Vec), most common word sense detection, relation identification, cross document coreference, noun phrase semantic segmentation, term set expansion, topics and trend analysis, optimizing NLP/NLU models
Top2Vec
For topic modeling, semantic search, and word and document embeddings
tidytext
For converting to and from non-tidy formats, word and document frequency analysis (tf-idf), n-grams and correlations, sentiment analysis with tidy data, and topic modeling
topicmodels
For Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors; provides an interface to the C code
BTM
For identifying topics in texts from term-term cooccurrences (hence 'biterm' topic model, BTM)
topicdoc
For LDA and CTM topic models to assist in evaluating topic quality; provide topic-specific diagnostics
lda
For Latent Dirichlet Allocation and related models similar to LSA and topic models
stm (Structural Topic Model)
For implementing a topic model derivate that can include document-level meta-data; also includes tools for model selection, visualization, and estimation of topic-covariate regressions
text2vec
For text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), and similarities
mscstexta4r
For sentiment analysis, topic detection, language detection, and key phrase extraction; provides an interface to the Microsoft Cognitive Services Text Analytics API
Weka
For data preprocessing (e.g., stemming, data resampling, transformation), classification, regression, clustering, latent semantic analysis (LSA, LSI), association rules, visualization, filtering, and anonymization