An introduction to text mining/analysis and resources for finding text data, preparing text data for analysis, methods and tools for analyzing text data, and further readings regarding text mining and its various methods.
Open text data sources are freely available for use. Below is a list of open text data sources, organized by discpline.
U.S. Census Bureau Data
Here you can find the public data compiled by the U.S. Census Bureau in a single platform. See the FAQ for questions regarding how to use the U.S. Census Bureau APIs and how to access the data that has not been transferred yet to this platform.
A wealth of open data from the U.S. government as well as tools and visualizations.
Online Books Page
Lists over 2 million free books on the internet (includes Project Gutenberg, Hathi Trust, Google Books, publisher and institutional archives, etc). Provides a section on non-English language texts.
Crossref text and data mining
Crossref can be used by researchers to easily harvest full text documents from participating publishers regardless of their business model (e.g. open access, subscription). Provides step-by-step instructions.
Cultoromics Bookworm Viewer
Interface tool for queries in the Google Books corpus. Developed by the Culturomics folks at Harvard it limits itself to only those digitized texts which have information about them (Full title, Publication Date, Publication Place, etc.) on OpenLibrary.org. Users can run queries in highly selective corpora based on subject (books on world history, American books on science, etc.) though these corpora are much smaller than those in the full Google Books collection.
Early English Books Online
EEBO-TCP is a partnership with ProQuest and with more than 150 libraries to generate highly accurate, fully-searchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database.
Europeana is a digital library with millions of books, paintings, films, museum objects and archival records that have been digitized by more than 2,000 institutions across Europe.
Contains the entire canon of Early Greek epic in the original and in translation, as well as Chaucer, Shakespeare, and Spenser. Texts are annotated or tagged by morphological, lexical, prosodic, and narratological criteria. User interface allows non-technical users to explore the greatly increased query potential of textual data for computer-assisted study.
Health and Science Sources
arXiv.org is a repository of electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance, which can be accessed online. Produced by Cornell University.
Over 315,000 full-text, peer-reviewed science, technology and medicine articles are available for text and data mining.