Identifying and locating sources of existing data can be important for a variety of reasons, including: asking new questions or providing a new analysis of the data, comparing results from various studies, replicating and validating previous results, developing or testing computational models, and extending a study across time, geography, or other variables by incorporating data from multiple datasets.
The resources listed below can help you find relevant datasets for use in your research. Many also serve as repositories if you are interested in a place to deposit and share your own research data.
The listings focus mostly on publicly available open access data sources, rather than commercial databases. However, some resources may require subscriptions or involve other costs, and some may require registration or acceptance of data use agreements.
These online directories maintain lists of data sources and repositories across a wide range of disciplines.
re3data - A global registry of research data repositories covering a wide variety of academic subjects in the sciences, social sciences, and humanities.
Open Access Directory of Data Repositories - Lists of open access data repositories for a wide range of subject areas.
These repositories maintain data from a wide range of subject areas and are not limited to a particular discipline.
figshare - A repository for sharing all types of research output in any subject - includes papers, figures, posters, slides.
Amazon Web Services Public Data Sets * - Hosts a variety of large public datasets, such as Landsat, census, and genomic data. Creating an account may be required and charges may apply for computing time and data transfer.
D-Scholarship@Pitt - The University of Pittsburgh's institutional repository. It is just beginning to build its dataset collection.
* Charges may apply.
The following are examples of data repositories that focus on a particular subject area, discipline, or cluster of related disciplines within the broad categories of humanities, sciences, social sciences, and government. For more detailed listings of repositories related to specific disciplinary areas, check the ULS guide on Finding Data
OLAC – Open Language Archives Community - An international partnership “creating a worldwide virtual library of language resources,” currently with 58 participating archives.
Mutopia Project - Free sheet music.
BIOLOGY / LIFE SCIENCES
DRYAD - General purpose repository for data underlying scientific and medical publications, historically with a concentration in life sciences.
Gene Expression Atlas - Information on gene expression patterns under different biological conditions, such as different cell types, organism parts, or diseases. ?
genenames.org (HUGO Gene Nomenclature Committee) - Curated repository of HGNC approved gene names and symbols, gene families, and links to related genomic, proteomic, and phenotypic information.
NCBI (National Center for Biotechnology Information) - Provides access to a variety of sources for biomedical and genomic data, including:
Conserved Domain Database (CDD) - Sequence alignments and profiles representing protein domains conserved in molecular evolution.
Gene - Gene data from a variety of species with related information, such as nomenclature, chromosome location, phenotypes, etc.
Database of Genotypes and Phenotypes (dbGaP) - Data and results from investigations of the interaction of genotypes and phenotypes in humans.
WormBase - Data on the genetics, genomics, and biology of C. elegans and some related nematodes.
UniProt (The Universal Protein Resource) - Collection of databases that provide a comprehensive source for protein sequence and annotation data, including a repository for metagenomics and environmental data.
eCrystals - Mostly open access source of fundamental and derived data from single crystal X-ray structure determinations from the University of Southampton and EPSRC UK National Crystallography Service.
PubChem - Database of chemical substances with descriptive and property information along with bioactivity screening data.
Zinc15 - Database of commercially available compounds with 3-D structure representations in a format ready for virtual screening for potential biological activity.
GTAP Database – Global Trade Analysis Project - Global database describing bilateral trade patterns, production, consumption and intermediate use of commodities and services.
National Agricultural Statistics Service - US agriculture data on production and supplies of food and fiber, prices paid and received by farmers, farm labor and wages, farm finances, chemical use, and changes in the demographics of U.S. producers. Includes state and county level data.
US Department of Labor, Bureau of Labor Statistics - Data on the US economy and society, including: Inflation & Prices, Employment, Unemployment, Pay & Benefits, Spending & Time Use, Productivity, Workplace Injuries, International, and Regional Resources.
US Federal Reserve System
FRASER® - Federal Reserve Archival System for Economic Research - Historical economic and banking data and policy documents. Digitized documents include various Federal Reserve publications, economic data publications, statistics, and Congressional documents.
FRED® - Federal Reserve Economic Data - Over 384,000 economic time series from 80 sources, including consumer price index, gross domestic product, unemployment, personal consumption expenditures, etc. Download data in spreadsheets or text.
ALFRED® - Archival FRED® - Versions of economic data that were available on specific dates in the past. Categories available include Money, Banking, & Finance; Population, Employment, & Labor Markets; Prices; International Data, and more.
GeoFRED® - Geographical Economic Data - Maps of data contained in FRED®. Create customized maps and download data.
U.S. Census Data - Covers the US, Puerto Rico, and island areas. Data on population, housing, demographics, economic characteristics, etc. from various censuses and surveys.
City of Pittsburgh Open Data - Housed at the Western Pennsylvania Regional Data Center. Topics include arts & culture, city facilities, environment, historic property, housing, public safety, recreation, transportation, and zoning.
Data.gov - Open data from the US government with a significant amount of geospatial data. Topics include: agriculture, business, climate, education, energy, health, local government, public safety, science & research, and more.
Pennsylvania State Data Center - Pennsylvania's official source of population and economic statistics.
Western Pennsylvania Regional Data Center - The WPRDC maintains Allegheny County and the City of Pittsburgh’s open data portal. Features data covering a variety of topics and geographies.
Many journals can be helpful tools in locating data, although they can play different roles as noted below.
Traditional Articles that Publish Data
These traditional "data journals" publish only articles that focus on presenting data, either experimental or computational, or may review experimental methods.
Journal of Physical and Chemical Reference Data - Publishes articles reporting critically evaluated reference data and property measurements.
Journal of Chemical and Engineering Data - Publishes both experimental and computational data.
Also sometimes known as "supplementary" information, these files are posted in the journal with their associated article when it is published. Files can contain data that support the content of the article, but are too extensive to include in the article itself or are not essential for every reader. Files may be in document form and not necessarily machine-readable. Although supporting information is often freely accessible even without a subscription to the journal, the publisher may still retain copyright in the files.
Data Journals or "Data Paper" Journals
These newer style "data journals" primarily publish articles that describe publicly available datasets and link to those datasets.They may also publish articles on data-related topics, such as describing or reviewing certain analytical or statistical methods. However, traditional research articles that actually analyze the data and draw conclusions from that analysis are generally outside the scope of these journals.
Biodiversity Data Journal - Community peer-reviewed and open-access. Promotes the publishing, dissemination and sharing of biodiversity-related data of any kind. Publishes data papers, general articles, software descriptions, species inventories, and more.
Earth System Science Data - An international interdisciplinary journal that provides a distinctive model for publishing papers about original research data sets and encouraging the reuse of high quality data. Includes methods and review articles and a "living data" process for handling datasets that undergo regular updating or extension.
IUCrData - Open-access and peer-reviewed. Provides descriptions of crystallographic datasets and datasets from related disciplines.
Scientific Data - Open-access and peer-reviewed. Its Data Descriptor articles describe data sets, the method of data collection and analyses relating to the quality of the data. They also link to one or more published sources of the data.
These journals publish a mixture of article types, including "data papers" that describe datasets along with traditional research articles and other formats.
International Journal of Robotics Research - Publishes peer-reviewed data papers and multimedia extensions in addition to articles.
Internet Archaelogy - Open access and peer-reviewed. Publishes data papers as well as research articles, methodologies, reviews and more.
Nucleic Acids Research - For more than 20 years has published a special issue in January that reports on databases containing data related to bioinformatics generally, including nucleic acids, proteins, and genomics.
These are only a few examples of journals that can point you to useful data. For more complete listings, check these sites:
Sources of Dataset Peer Review (from the Edinburgh DataShare Wiki)
A Growing List of Data Journals (from Data@MLibrary)
Open Data Journals (from the FOSTER project)
If you’re reusing a dataset to inform your own work, you’ll want to make sure that you are providing proper recognition. Datasets are scholarly products and should be cited as such. If you are using a dataset that was deposited in a disciplinary data repository, you may find that the repository has a recommended citation standard.
ICPSR provides useful guidance on data citations and suggests that a citation for a dataset should include the following basic elements:
For general information about citing a dataset, see the following resources:
There are efforts among researchers, librarians, archivists, funders, publishers and others to develop and communicate a set of best practices around data citation. See: Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014.