Skip to Main Content

Course & Subject Guides

Text Mining & Analysis @ Pitt

An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.

Library Databases

Most of the Library's databases do not allow text mining due to license agreements with the publishers. However, the databases below can be accessed for text mining with the exclusive purpose of academic research under Fair Use. Each publisher has its own usage terms and conditions and copyright provisions, which should be followed at all times. Some are free while others incur a fee for dataset requests. Please contact us if you have any questions or concerns about using Library databases as a data source for your research. 

When mining these resources, please keep in mind the following tips (tips source: Univ. of Queensland Library):

  • Some publishers will require you to use tools they provide to mine their content, or will conduct the process for you. In this way they can manage the quantity of data being accessed and the impact on their servers.

  • Downloading large amounts of data can trigger automatic lockouts and prevent access to resources by other users. In some instances the publisher may apply a fee for the additional usage that sits outside of our existing agreement.

 

Gale Primary Sources

Gale Primary Sources includes 24 individual databases, which include an extensive newspaper collection and documents from different eras. Gale Primary Sources offers a Term Frequency search option and a Topic Finder viewer. Researchers can also obtain datasets through a librarian by paying a fee that varies according to the dataset and its size. Gale Data Mining page for more information.

 

HathiTrust

HathiTrust Research Center Analytics provides free computational access to HathiTrust, through you can access nearly 14 million books for analysis. Please see the HathiTrust Research Center documentation for more information. For general information about HathiTrust, please also see our HathiTrust LibGuide

 

JSTOR Data for Research

JSTOR Data for Research offers text mining tools for selecting and analyzing the content in JSTOR, which includes more than 12 million academic journals, books, and primary sources from all disciplines. Researchers may create a dataset of up to 25,000 documents (metadata and/or n-grams) using the self-service option, or may obtain larger datasets and full-text datasets by special request.

 

ProQuest

ProQuest's many primary source databases can be mined for a fee.

 

Science Direct

Science Direct, which provides access to wide variety of scientific and medical research articles and books, can be mined through Elsevier's Full Text API. Please see the Elsevier Text and Data Mining Policy statement, the FAQs and the developer's portal for more information.​

 

SpringerLink

Researchers can mine the scientific journals and books found in the SpringerLink database by using the SpringerNature API Portal for access and further instructions. Details about SpringerNature's text mining policy can be found here. The database may also be mined directly: content can be downloaded via  a web browser or through an HTTP GET request using a scripting tool such as Python's urllib.

 

Web of Science

Web of Science's multiple databases can be mined for free or by payment depending on the dataset and its size. To do so, researchers need to use the APIs provided by the publisher.