Pitt community: write to Digital Scholarship Services or use our AskUs form
Pitt health sciences researchers: contact Data Services, Health Sciences Library System
Dominic Bordelon, dbordelon@pitt.edu
"Data Sharing @ Pitt" by University of Pittsburgh Library System is licensed for reuse under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Consider the file formats of your data files, for example, Excel spreadsheets (XLSX), comma-separated values (CSV), or Word documents of interview transcripts. File formats vary not only as to their content and information storage methods, but also in terms of how tightly they are controlled as intellectual property. Some formats, such as CSV and XML (eXtensible Markup Language), are entirely in the public domain and readily accessible in a broad variety of software. Meanwhile, certain other formats such as Stata’s .dta format, are proprietary and require the format creator’s software to use. As followers of the FAIR Principles—Accessibility and Interoperability, specifically—we would like to minimize the amount of data that we need to offer in proprietary formats, and maximize that offered in open formats.
A certain class of formats is formally proprietary but functionally open: the Microsoft Office suite and the Adobe Portable Document Format. While companies maintain ownership of the IP, they have also shared documentation with the public developer community which allows implementation of file reading and writing in environments such as Python, R, the command line, or a graphical software product. This means that, practically speaking, reviewers and future readers probably will be able to access and analyze your Word documents, Excel spreadsheets, and PDFs. However, it is always a good idea to export/convert these rich formats into a more basic, open format such as CSV or plain text (TXT). When you perform such conversions, you will lose rich formatting; this is an inevitable tradeoff.
Offering data in more formats also hedges against risk to any one file: for example, if a data set’s Excel file gets corrupted due to “bit rot” (i.e., entropy in the data storage system) in some distant future, then the data may still be recoverable from the corresponding CSV file.
The table below lists some types of information and examples of associated open file formats. If you are sharing information listed in the left column, check whether it is already in a format listed in the right column.
Information type |
Open file format(s) |
---|---|
Text |
TXT (encoded in UTF-8 or ASCII) |
Tabular |
CSV, TSV |
Structured and/or tagged document |
XML, TEX, HTML |
Image |
PNG, JPEG*, SVG |
Audio |
FLAC, MP3*, WAV |
Video |
H.264 codec in MP4, MKV |
Archive (collection of files) |
ZIP, TAR, GZ |
* Asterisk (*) denotes inherently "lossy" formats, which are less desirable for preservation. Certain compression schemas, such as H.264 and TIFF, are available in both lossy and lossless versions.
PDF and Office Suite files are acceptable but ideally accompanied by alternatives such as TXT and CSV.
💡 A final tip about file formats: in File Explorer (Windows) or Finder (macOS), make sure to enable visibility of file name extensions! Your file listing should show “Document.docx” instead of “Document”, for example. Windows instructions; macOS instructions.