Pitt community: Write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: Contact Data Services, Health Sciences Library System
Text classification is the task of classifying a text or series of texts into one or more categories through natural language processing.
NLTK (Natural Language Toolkit)
For accessing corpora and lexicons, tokenization, stemming, (part-of-speech) tagging, parsing, transformations, translation, chunking, collocations, classification, clustering, topic segmentation, concordancing, frequency distributions, sentiment analysis, named entity recognition, probability distributions, semantic reasoning, evaluation metrics, manipulating linguistic data (in SIL Toolbox format), language modeling, and other NLP tasks
spaCy
For tokenization, named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
scikit-learn
For classification, regression, clustering, dimensionality reduction, model selection, and preprocessing
NLP Architect
For word chunking, named entity recognition, dependency parsing, intent extraction, sentiment classification, language models, transformations, Aspect Based Sentiment Analysis (ABSA), joint intent detection and slot tagging, noun phrase embedding representation (NP2Vec), most common word sense detection, relation identification, cross document coreference, noun phrase semantic segmentation, term set expansion, topics and trend analysis, optimizing NLP/NLU models
flair
For part-of-speech tagging (PoS), named entity recognition (NER), classification, sense disambiguation, word and document embeddings
HuggingFace Transformers
For classification, information extraction, question answering, summarization, translation, text generation, masked language prediction, and other NLP, NLU (Natural Language Understanding), and NLG (Natural Language Generation) tasks
TextBlob
For processing textual data, providing a simple API for diving into common natural language processing (NLP) tasks such as noun phrase extraction, part-of-speech tagging, classification (Naive Bayes, Decision Tree), translation, tokenization (splitting text into words and sentences), parsing n-grams, word inflection (pluralization and singularization) and lemmatization, spelling correction, adding new models or languages through extensions, and wordNet integration
Spark NLP
For tokenization, word segmentation, part-of-speech tagging, named entity recognition, dependency parsing, spell checking, multi-class text classification, transformation (BERT, XLNet, ELMO, ALBERT, and Universal Sentence Encoder), multi-class sentiment analysis, machine translation (+180 languages), summarization and question Answering (Google T5), and many more NLP tasks
Pattern
For webscraping (Google, Wikipedia, Twitter, Facebook, generic RSS, etc.), web crawling, HTML DOM parsing, part-of-speech tagging, n-gram search, sentiment analysis, vector space modeling, clustering, classification (KNN, SVM, Perceptron), graph centrality and visualization
fastText
For text classification and representation learning
openNLP
For NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution
RcmdrPlugin.temis
For performing a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis, correspondence analysis and hierarchical clustering
RWEKA
For stemming, data transformation, distribution-based balancing of datasets, replacing missing numerical values, dataset resampling, anonymization, normalization, classification, regression, clustering, association rules, and visualization
tm
For importing and handling corpus data, metadata management, stemming, stop word deletion, removal of white space, string processing, count-based analysis methods, text clustering, text classification, and string kernels
CoreNLP
For deriving linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment analysis, quote attributions, and relations
Weka
For data preprocessing (e.g., stemming, data resampling, transformation), classification, regression, clustering, latent semantic analysis (LSA, LSI), association rules, visualization, filtering, and anonymization