freely available on Amazon S3 in a Hadoop friendly file format and licensed under a Creative Commons Attribution 3.0 Unported License. Knowledge of Amazon EMR, a managed cluster platform that simplifies running big data frameworks on AWS to process and analyze vast amounts of data, is required.
Comprised of content from 90+ academic libraries, it contains more than 13 million volumes. Haithi Trust currently enables computational access (including text mining and topic modeling) to the 2.7 million public domain works and is looking toward access to works in copyright. A new set of tools is in beta testing that will allow easier interaction with HTRC content.
An online archive of high resolution images of cultural heritage materials developed through the University of Pennsylvania. These public domain or Creative Commons License collections have machine-readable descriptions and technical metadata.
includes a collection of electronic literary and linguistic resources available for download that may be used as data sets. Downloadable formats supported include XML, HTML, and plain text.