Guides: Text Mining &amp; Analysis @ Pitt: Open Sources

Open Sources

Open text data sources are freely available for use. Below is a list of open text data sources, organized by discpline.

General Sources

U.S. Census Bureau Data
Here you can find the public data compiled by the U.S. Census Bureau in a single platform. See the FAQ for questions regarding how to use the U.S. Census Bureau APIs and how to access the data that has not been transferred yet to this platform.
Data.gov
A wealth of open data from the U.S. government as well as tools and visualizations.
Open Data Pennsylvania
Pennsylvania state government provides many datasets for the general public./
Western Pennsylvania Regional Data Center
Numerous datasets on the region of Western Pennsylvania, as well as tools and a community network for building and sharing data.
Digital Public Library of America
DPLA offers a single point of access to millions of items from libraries, archives, and museums around the United States. Data is available for bulk download in JSON files.
Google Books
Search full text of books in many languages. Download books in the public domain. The Advanced Search allows you to filter for "full-view".
Internet Archive & Open Library
Offers over 10,000,000 fully accessible books and texts. Includes texts, audio, moving images, and software as well as archived web pages in their collection. Instructions for downloading in bulk.
Online Books Page
Lists over 2 million free books on the internet (includes Project Gutenberg, Hathi Trust, Google Books, publisher and institutional archives, etc). Provides a section on non-English language texts.
Project Gutenberg
The first producer of free electronic books (ebooks), their catalog includes over 53,000 free books and over 100,000 titles. Here is the Project's Terms of Use.
Crossref text and data mining
Crossref can be used by researchers to easily harvest full text documents from participating publishers regardless of their business model (e.g. open access, subscription). Provides step-by-step instructions.
Wikidata
Structured data from Wikipedia and other open knowledge bases, available via direct download or API. Wikidata: data access.

Humanities and Social Sciences Sources

Chronicling America: Historical American Newspapers
Collection of digitized historical newspapers from 1789-1924. OCR batch downloads available.
Cultoromics Bookworm Viewer
Interface tool for queries in the Google Books corpus. Developed by the Culturomics folks at Harvard it limits itself to only those digitized texts which have information about them (Full title, Publication Date, Publication Place, etc.) on OpenLibrary.org. Users can run queries in highly selective corpora based on subject (books on world history, American books on science, etc.) though these corpora are much smaller than those in the full Google Books collection.
Early English Books Online
EEBO-TCP is a partnership with ProQuest and with more than 150 libraries to generate highly accurate, fully-searchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database.
Europeana APIs
Europeana is a digital library with millions of books, paintings, films, museum objects and archival records that have been digitized by more than 2,000 institutions across Europe.
University of Oxford Text Archive
A repository of digital literary and linguistic resources for research and teaching in higher education.
WordHoard
Contains the entire canon of Early Greek epic in the original and in translation, as well as Chaucer, Shakespeare, and Spenser. Texts are annotated or tagged by morphological, lexical, prosodic, and narratological criteria. User interface allows non-technical users to explore the greatly increased query potential of textual data for computer-assisted study.

Health and Science Sources

arXiv.org
arXiv.org is a repository of electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance, which can be accessed online. Produced by Cornell University.
BioMed Central
Over 315,000 full-text, peer-reviewed science, technology and medicine articles are available for text and data mining.
PLOS
Public Library of Science. Provides access to its peer-reviewed articles. Provides a specific Text Mining Collection.
PubMed Central Databases and Text Mining Tools
Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records.

Law Sources

CaseLaw Access Project
CAP includes all official, book-published United States case law — every volume designated as an official report of decisions by a court within the United States

University of Pittsburgh Library System

Course & Subject Guides

Text Mining & Analysis @ Pitt

Get Help with Text Mining & Analysis

Guide Contributors

Open Sources

General Sources

Humanities and Social Sciences Sources

Health and Science Sources

Law Sources