Skip to Main Content

Course & Subject Guides

Text Mining & Analysis @ Pitt

An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.

Text Classification

Text classification is the task of classifying a text or series of texts into one or more categories through natural language processing.  

 

Tools

 

Out-of-the-Box
  • MALLET
    For statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text
 
Programmatic

Python

  • NLTK (Natural Language Toolkit)
    For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks

  • spaCy
    For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more

  • scikit-learn
    For classification, regression, clustering, dimensionality reduction, model selection, and preprocessing

  • NLP Architect
    For word chunking, named entity recognition, dependency parsing, intent extraction, sentiment classification, language models, transformations, Aspect Based Sentiment Analysis (ABSA), joint intent detection and slot tagging, noun phrase embedding representation (NP2Vec), most common word sense detection, relation identification, cross document coreference, noun phrase semantic segmentation, term set expansion, topics and trend analysis, optimizing NLP/NLU models

  • flair
    For part-of-speech tagging (PoS), named entity recognition (NER), classification, sense disambiguation, word and document embeddings

  • HuggingFace Transformers
    For classification, information extraction, question answering, summarization, translation, text generation, masked language prediction, and other NLP, NLU (Natural Language Understanding), and NLG (Natural Language Generation) tasks

  • TextBlob
    For processing textual data, providing a simple API for diving into common natural language processing (NLP) tasks such as noun phrase extraction, part-of-speech tagging, classification (Naive Bayes, Decision Tree), translation, tokenization (splitting text into words and sentences), parsing n-grams, word inflection (pluralization and singularization) and lemmatization, spelling correction, adding new models or languages through extensions, and wordNet integration

  • Spark NLP
    For tokenization, word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, spell checking, multi-class text classification, transformation (BERT, XLNet, ELMO, ALBERT, and Universal Sentence Encoder), multi-class sentiment analysis, machine translation (+180 languages), summarization and question Answering (Google T5), and many more NLP tasks

  • Pattern
    For webscraping (Google, Wikipedia, Twitter, Facebook, generic RSS, etc.), web crawling, HTML DOM parsing, part-of-speech tagging, n-gram search, sentiment analysis, vector space modeling, clustering, classification (KNN, SVM, Perceptron), graph centrality and visualization

  • fastText
    For text classification and representation learning

R

  • openNLP
    For NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution

  • RcmdrPlugin.temis
    For performing a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering

  • RWEKA
    For stemming, data transformation, distribution-based balancing of datasets, replacing missing numerical values, dataset resampling, anonymization, normalization, classification, regression, clustering, association rules, and visualization

  • tm
    For importing and handling corpus data, metadata management, stemming, stop word deletion, removal of white space, string processing, count-based analysis methods, text clustering, text classification, and string kernels

Java

  • CoreNLP
    For deriving linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment analysis, quote attributions, and relations

  • Weka
    For data preprocessing (e.g., stemming, data resampling, transformation), classification, regression, clustering, latent semantic analysis (LSA, LSI), association rules, visualization, filtering, and anonymization